Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: If 75 Games are not considered a Statistical proof, neither is the SSDF.

Author: Dann Corbit
Date: 19:33:11 01/30/01
On January 30, 2001 at 22:18:15, Bruce Moreland wrote:

>On January 30, 2001 at 21:24:13, Robert Hyatt wrote:
>
>>On January 30, 2001 at 17:42:59, Dann Corbit wrote:
>
>>>Additional measurements will not (in general) make the answer less accurate
>>>(unless something is wrong with the measurements).  However, if two programs are
>>>about equal, you will [basically] never determine which is stronger by playing
>>>them against each other.  For anyone who would like to prove this to themselves,
>>>just play a program against itself 10 times, 50 times, 100 times and 1000 times.
>>
>>This is what most miss, statistics-wise.  For someone with the time, it would
>>be interesting to have them use xboard/winboard and play a bunch of 100 game
>>matches between the _same_ version of an engine.  The outcome can be absolutely
>>astounding.  IE out of 100 games, you might get 30 wins, 10 losses, 60 draws
>>by A.  The next time you get 60 wins, 20 losses, 20 draws by A.  You begin
>>to conlcude A is better until you realize A and B are _identical_.  You run
>>the test again and B wins this time.
>
>Give me a match result, and the approximate rate at which the program draws
>games with a particular opponent, and I'll tell you how likely it is that the
>weaker program won the match.
>
>>I ran a _bunch_ of 100 game matches to convince me that no-recapture was better
>>than using recapture.  In my program.  But the first two 100 game matches had
>>me convinced that using recapture was better.  But the next 50 matches had
>>no-recapture winning almost all, although never by more than 10-20 points
>>total (out of 100 total).
>>
>>Pretty interesting stuff.  If I take that first 100 game match, and extract
>>any consecutive N games I choose, I can produce any outcome I would want.
>>
>>IE N-0-0, 0-N-0, 0-0-N, etc.  Which means that a small number of games is
>>hardly more than a crap-shoot.  Which makes you wonder what any tournament
>>winner means at all.  :)
>
>No, this is not true.  Here is an example.  Let's take two equally strong
>programs and discount the possibility of draw results (all games are won or
>lost).  The odds that a particular side will win 10 times in a row are 1 / 2^10.
> That's a relatively small number of games in the match, and the odds of getting
>that result with equal programs are quite low.  But if you do a long enough
>series, you will eventually find 10 wins in a row, even if the programs are the
>same strength.
>
>This does not mean that 10 wins in a row means nothing.  It means that it's
>improbable that the outcome is due to chance.  Obviously that doesn't mean that
>the outcome is impossible, and if you do a long enough run, it's going to happen
>*somewhere in the run*.
>
>If you *start* your run with two unknown programs, and encounter 10 wins in a
>row right at the start, what does that tell you?  It's a very unlikely outcome.
>Not impossible, but unlikely.
>
>You have to make a conclusion.  Does it make sense to say, "This could have been
>chance, I'll throw it out"?  It doesn't make sense.  Of course it could have
>been chance, but if you aren't willing to accept that it is much more likely
>that one of the programs is stronger than the other one, you may as well not
>play matches.
>
>Of course, if you *know* that they are the same strength, because they are the
>same engine, all you are doing is a probability exercise.  You are just flipping
>a coin a bunch of times and going "ooh" if you get a weird outcome.  And if you
>get a lot of insane outcomes your test is probably broken.
>
>The most important point I'd like to make with this, is that any match result
>can be achieved due to chance, assuming some hypothetical Elo delta between the
>two programs.

That is exactly why a large number of runs are needed.  What they will give you
is a window of probability.  In other words, there is a 2/3 chance that the
strength of A is x +/- windowA and the strength of B is y +/= windowB.  There is
a 95 % chance that the strength of A is x +/- windowA2 and the strength of B is
y +/= windowB2.  As you run more and more tests, the windows get smaller and
smaller.  If you run a huge number of tests, the windows will be pinholes.  If
you flip a coin one trillion times and get 47% heads, then you can show
statistically that it is NOT a fair coin.

>Once again discounting the possibility of draws, a result of +40 -60 or worse
>for you will come up by chance almost 3% of the time, assuming the two programs
>are of equal strength.
>
>Why does this inspire more confidence than a result of +0 -10, which will come
>up by chance only about 0.1% of the time?
>
>You have two outcomes.  One is very unlikely, one is ridiculously unlikely.  Why
>are people willing to believe that there is a higher probability that the second
>one is due to chance?

For two series of ten measurements, you will clearly get different results.
Keeping it simple with the fair coin, we will have a red penny and a black penny
and flip each twice.  Here are the possible outcomes:

f1,f2
-- --
RH,RH
RH,RT
RT,RH
RT,RT

BH,BH
BH,BT
BT,BH
BT,BT

So, we have one way to get 4 heads and one way to get 4 tails and 6 ways to get
something else.  We therefore have the astonishing odds of 1/4 for a whitewash
(for one side or the other) with a completely even match.  Now, expand the
experiment to include one trillion flips and we still have two different ways to
achieve a whitewash but as you can see, the probability [just by counting] of a
whitewash has already become vanishingly small by this time.

>The only real gotcha with the +0 -10 result is that some fool might do a bunch
>of 10-game runs until they get that result, or that they'll pick a +0 -10 run
>out of a larger run and claim that it means something, which it doesn't.
>
>If you start two programs playing, and achieve a +0 -10, you've all but proven
>that the second is better than the first.

The SSDF has had results that started out like this and the "worst" program won!

>If you do a fixed 100-game match and
>get +40 -60, the odds of an erroneous conclusion are *much higher*, and so
>should be looked upon with more skepticism.

Back to class.
;-)

>>> The figure *should* [obviously] hover around 50% points scored for each side.
>>>It is very unlikely that the ten game match will be close to 50%.  The 100 game
>>>match will probably be fairly close.  It is rather unlikely that the 1000 game
>>>match will be far from 50%, but it is very unlikely it will be exactly 50%.  In
>>>fact, if it should be exactly 50%, the Chi-Squared Test will reject it!  It
>>>throws out both things that don't seem to fit the model and also things that fit
>>>so perfectly something looks fishy.
>>
>>
>>I think that 1000 game match might well be way off from 50%.  IE it is not
>>unlikely that one side will pull ahead a significant amount, and then they
>>start playing equally for the rest of the match.  But there is nothing in
>>statistics that says after you flip a coin and get 100 consecutive heads,
>>that sometime later you will get 100 consecutive tails to offset them.  It
>>is more likely that this series will simply end with heads being ahead 100
>>counts...
>
>In a 1000 game match, the odds that your program will score will be between 462
>and 538 are over 98%, assuming that it's exactly as strong as its opponent, and
>assuming that draws are not possible.
>
>With a draw percentage of 35%, you have a 98% chance of being between 470 and
>530.  About half the time you'll be between 491 and 509.
>
>With 100 games and 35% draws, about 98% of the time you'll be somewhere between
>41 and 59, and half the time you'll be between 47 and 53.
>
>So you have quite a bit less variation with 1000 than you do with 100.

Now here we agree.  I don't understand why you think less trials give better
data, since you seem to understand what is going on here.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.