Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: If 75 Games are not considered a Statistical proof, neither is the SSDF.

Author: Bruce Moreland

Date: 19:18:15 01/30/01

Go up one level in this thread


On January 30, 2001 at 21:24:13, Robert Hyatt wrote:

>On January 30, 2001 at 17:42:59, Dann Corbit wrote:

>>Additional measurements will not (in general) make the answer less accurate
>>(unless something is wrong with the measurements).  However, if two programs are
>>about equal, you will [basically] never determine which is stronger by playing
>>them against each other.  For anyone who would like to prove this to themselves,
>>just play a program against itself 10 times, 50 times, 100 times and 1000 times.
>
>This is what most miss, statistics-wise.  For someone with the time, it would
>be interesting to have them use xboard/winboard and play a bunch of 100 game
>matches between the _same_ version of an engine.  The outcome can be absolutely
>astounding.  IE out of 100 games, you might get 30 wins, 10 losses, 60 draws
>by A.  The next time you get 60 wins, 20 losses, 20 draws by A.  You begin
>to conlcude A is better until you realize A and B are _identical_.  You run
>the test again and B wins this time.

Give me a match result, and the approximate rate at which the program draws
games with a particular opponent, and I'll tell you how likely it is that the
weaker program won the match.

>I ran a _bunch_ of 100 game matches to convince me that no-recapture was better
>than using recapture.  In my program.  But the first two 100 game matches had
>me convinced that using recapture was better.  But the next 50 matches had
>no-recapture winning almost all, although never by more than 10-20 points
>total (out of 100 total).
>
>Pretty interesting stuff.  If I take that first 100 game match, and extract
>any consecutive N games I choose, I can produce any outcome I would want.
>
>IE N-0-0, 0-N-0, 0-0-N, etc.  Which means that a small number of games is
>hardly more than a crap-shoot.  Which makes you wonder what any tournament
>winner means at all.  :)

No, this is not true.  Here is an example.  Let's take two equally strong
programs and discount the possibility of draw results (all games are won or
lost).  The odds that a particular side will win 10 times in a row are 1 / 2^10.
 That's a relatively small number of games in the match, and the odds of getting
that result with equal programs are quite low.  But if you do a long enough
series, you will eventually find 10 wins in a row, even if the programs are the
same strength.

This does not mean that 10 wins in a row means nothing.  It means that it's
improbable that the outcome is due to chance.  Obviously that doesn't mean that
the outcome is impossible, and if you do a long enough run, it's going to happen
*somewhere in the run*.

If you *start* your run with two unknown programs, and encounter 10 wins in a
row right at the start, what does that tell you?  It's a very unlikely outcome.
Not impossible, but unlikely.

You have to make a conclusion.  Does it make sense to say, "This could have been
chance, I'll throw it out"?  It doesn't make sense.  Of course it could have
been chance, but if you aren't willing to accept that it is much more likely
that one of the programs is stronger than the other one, you may as well not
play matches.

Of course, if you *know* that they are the same strength, because they are the
same engine, all you are doing is a probability exercise.  You are just flipping
a coin a bunch of times and going "ooh" if you get a weird outcome.  And if you
get a lot of insane outcomes your test is probably broken.

The most important point I'd like to make with this, is that any match result
can be achieved due to chance, assuming some hypothetical Elo delta between the
two programs.

Once again discounting the possibility of draws, a result of +40 -60 or worse
for you will come up by chance almost 3% of the time, assuming the two programs
are of equal strength.

Why does this inspire more confidence than a result of +0 -10, which will come
up by chance only about 0.1% of the time?

You have two outcomes.  One is very unlikely, one is ridiculously unlikely.  Why
are people willing to believe that there is a higher probability that the second
one is due to chance?

The only real gotcha with the +0 -10 result is that some fool might do a bunch
of 10-game runs until they get that result, or that they'll pick a +0 -10 run
out of a larger run and claim that it means something, which it doesn't.

If you start two programs playing, and achieve a +0 -10, you've all but proven
that the second is better than the first.  If you do a fixed 100-game match and
get +40 -60, the odds of an erroneous conclusion are *much higher*, and so
should be looked upon with more skepticism.

>> The figure *should* [obviously] hover around 50% points scored for each side.
>>It is very unlikely that the ten game match will be close to 50%.  The 100 game
>>match will probably be fairly close.  It is rather unlikely that the 1000 game
>>match will be far from 50%, but it is very unlikely it will be exactly 50%.  In
>>fact, if it should be exactly 50%, the Chi-Squared Test will reject it!  It
>>throws out both things that don't seem to fit the model and also things that fit
>>so perfectly something looks fishy.
>
>
>I think that 1000 game match might well be way off from 50%.  IE it is not
>unlikely that one side will pull ahead a significant amount, and then they
>start playing equally for the rest of the match.  But there is nothing in
>statistics that says after you flip a coin and get 100 consecutive heads,
>that sometime later you will get 100 consecutive tails to offset them.  It
>is more likely that this series will simply end with heads being ahead 100
>counts...

In a 1000 game match, the odds that your program will score will be between 462
and 538 are over 98%, assuming that it's exactly as strong as its opponent, and
assuming that draws are not possible.

With a draw percentage of 35%, you have a 98% chance of being between 470 and
530.  About half the time you'll be between 491 and 509.

With 100 games and 35% draws, about 98% of the time you'll be somewhere between
41 and 59, and half the time you'll be between 47 and 53.

So you have quite a bit less variation with 1000 than you do with 100.

bruce



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.