Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: If 75 Games are not considered a Statistical proof, neither is the SSDF.

Author: Dann Corbit

Date: 17:40:03 01/30/01

Go up one level in this thread


On January 30, 2001 at 19:06:15, Dann Corbit wrote:

>On January 30, 2001 at 18:43:12, Bruce Moreland wrote:
>
>>On January 30, 2001 at 17:42:59, Dann Corbit wrote:
>>
>>>Additional measurements will not (in general) make the answer less accurate
>>>(unless something is wrong with the measurements).
>>
>>If A is 1000 Elo points stronger than B, you will probably have a more accurate
>>answer after 20 games than you will after 100 games if A is 1 Elo point stronger
>>than B, and you get close to the expected 50-50 result.
>
>This is simply not correct.  Let's choose a simpler model and something we know
>is about even:
>Heads or tails with a penny.
>
>Try ten flips, and have ten friends do the same.  Most of the 11 experiments
>will not have 5/5 divisions of heads/tails.
>
>Repeat the experiment with a larger number of flips until you get bored with it.
>As you get a larger and larger number of measurements, the probability that you
>get close to the right answer increases.  It does not decrease.  And if you
>average the results from all 11 experimenters, you will get an even better
>result (on average).

After re-reading your claims, I think I have a better picture what you were
driving at.  I think what you are saying is that it is easier to show
differences when the differences are great, and this is clearly correct.

I only wanted to clarify that more data will always give greater confidence in
the answer.

Also, the example you used with a 1000 ELO difference will actually make it very
difficult to show anything except which is stronger.  (In other words, with a
1000 ELO difference you will not have any reliable ELO figures for either
program after the experiment.  This particular extreme is just as bad as two
programs that are identical when it comes to ELO _calculation_).  On the other
hand, if you just want to know which is stronger, then obviously you will know
pretty well if the difference in strength is less.  The simple analog is if
someone asks which of two rulers is longer, or if they ask which is longer, a
tiny rowboat or the "Queen Mary" cruise ship.  To decide which ruler is longest
{and one will surely be longer than the other} will be very difficult and we
won't be very sure of our answer even after measuring one thousand times.  But
the ships we can tell at a glance.

>>It's not just number of games, another major factor is actual relative strength,
>>compared against the strength of the assertion you are trying to prove.
>
>This is correct.  You will also have problems if the strength difference is too
>great.  If (for instance) you have a program 200 ELO stronger and a program
>10000 ELO stronger, you might get 10-0 blankings 3 times in a row.  But there is
>an enormous difference in the strength of the programs.  But if you play the two
>weaker programs against each other, that will help a lot.
>
>>If you are trying to prove that A is no worse than 1000 Elo points worse than B,
>>it will almost certainly be very easy to confidently make this assertion after
>>20 games, if the two programs are the same strength.  If A really is about 1000
>>points worse than B, it will be harder.
>
>Another good comment that I agree with.
>
>>"A is stronger than B" can be a very weak claim, or it can be a very strong one.
>> That is why there is no fixed amount of games necessary to prove this, it
>>depends upon the actual Elo difference as measured by the match.
>
>The best way to get a really good number is to play a very large number of games
>against a pool of very diverse talent.
>
>>Of course, if you ran 500 games you could certainly make a claim that the
>>difference can't be too far from what you have measured.  If you get 252-248 you
>>can't declare that A is better than B, but you can certainly declare that A is
>>not likely to be much worse than B.
>
>You can declare anything (of course) but what a set of experiments will tell you
>is an ELO strength +/- some given window.  Considering (again) the top two SSDF
>list entries:
>
>  program    hardware            Rating +   -   Games Won  Average-opposition
>1 Fritz 6.0  128MB K6-2 450 MHz  2629   25 -24   845  67%  2506
>2 Junior 6.0 128MB K6-2 450 MHz  2589   23 -22  1027  65%  2483
>
>We see that within one standard deviation, Junior could be as strong as 2589 +
>23 = 2612, and Frits could be as weak as 2629 - 24 = 2605.  So, even within one
>standard deviation, we are not really sure which program is stronger.
>
>If we played only these two engines, it would be more difficult to get an
>accurate rating.  After a thousand games or so, we would not be at all sure
>which is stronger.  But after ten games, we would have no idea whatsoever.
>
>>At the risk of being repetitive, the difficulty of proving an assertion about
>>the strength of two programs, seems to be very dependent upon the degree to
>>which the assertion rides the razor edge of truth and falsity.  If it's just
>>barely true, you may never prove it.
>
>This is undeniable.  However, more data does nothing to detract from the quality
>of the answer.  That was my only point.
>
>If I have two chess engines and I play one game, my error bar is infinite.
>
>If I play 100 games between them, the error bar is smaller, but still very
>large.
>
>If I play one trillion games between them, the error bar is very small (but
>still not zero).  However, with each additional measurement, confidence in the
>calculated strength difference rises.  The closer the two programs are in
>strength, the more difficult it becomes to find out which one is really
>stronger.  When programs are of approximately the same strength, it is virtually
>impossible to prove which one is strongest.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.