Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: If 75 Games are not considered a Statistical proof, neither is the SSDF.

Author: Robert Hyatt

Date: 20:12:30 01/30/01

Go up one level in this thread


On January 30, 2001 at 22:18:15, Bruce Moreland wrote:

>On January 30, 2001 at 21:24:13, Robert Hyatt wrote:
>
>>On January 30, 2001 at 17:42:59, Dann Corbit wrote:
>
>>>Additional measurements will not (in general) make the answer less accurate
>>>(unless something is wrong with the measurements).  However, if two programs are
>>>about equal, you will [basically] never determine which is stronger by playing
>>>them against each other.  For anyone who would like to prove this to themselves,
>>>just play a program against itself 10 times, 50 times, 100 times and 1000 times.
>>
>>This is what most miss, statistics-wise.  For someone with the time, it would
>>be interesting to have them use xboard/winboard and play a bunch of 100 game
>>matches between the _same_ version of an engine.  The outcome can be absolutely
>>astounding.  IE out of 100 games, you might get 30 wins, 10 losses, 60 draws
>>by A.  The next time you get 60 wins, 20 losses, 20 draws by A.  You begin
>>to conlcude A is better until you realize A and B are _identical_.  You run
>>the test again and B wins this time.
>
>Give me a match result, and the approximate rate at which the program draws
>games with a particular opponent, and I'll tell you how likely it is that the
>weaker program won the match.
>
>>I ran a _bunch_ of 100 game matches to convince me that no-recapture was better
>>than using recapture.  In my program.  But the first two 100 game matches had
>>me convinced that using recapture was better.  But the next 50 matches had
>>no-recapture winning almost all, although never by more than 10-20 points
>>total (out of 100 total).
>>
>>Pretty interesting stuff.  If I take that first 100 game match, and extract
>>any consecutive N games I choose, I can produce any outcome I would want.
>>
>>IE N-0-0, 0-N-0, 0-0-N, etc.  Which means that a small number of games is
>>hardly more than a crap-shoot.  Which makes you wonder what any tournament
>>winner means at all.  :)
>
>No, this is not true.  Here is an example.  Let's take two equally strong
>programs and discount the possibility of draw results (all games are won or
>lost).  The odds that a particular side will win 10 times in a row are 1 / 2^10.
> That's a relatively small number of games in the match, and the odds of getting
>that result with equal programs are quite low.  But if you do a long enough
>series, you will eventually find 10 wins in a row, even if the programs are the
>same strength.

In the case I first studied, the recapture version lost the first game,
then won the next 11 in a row, then lost the next 6.  After the first 18
games, there were only 12 decisive results in the remaining 82 games.

A caveat of course:  the opening book plays a role in some of these games.
Which is difficult to eliminate unless we go for Nunn-type matches which I
don't like.

Another interesting result:  in one match, there were 17 draws.  In another
match, there were 71 draws.  I would think that these results were based on
some sort of random sampling had I not ran the tests myself.

This has happened too often to me to count.  IE new idea, implement it, test
vs old, after 20 games new idea is ahead 10-5-5.  Come back in the morning and
it lost 22-35-43.

I agree that if I play games against old GNU chess versions, I don't see that
kind of nonsense very often, and that most of this is highlighted by the fact
that the two versions I am testing are _very_ close to each other, making the
test very difficult to interpret.



>
>This does not mean that 10 wins in a row means nothing.  It means that it's
>improbable that the outcome is due to chance.  Obviously that doesn't mean that
>the outcome is impossible, and if you do a long enough run, it's going to happen
>*somewhere in the run*.
>
>If you *start* your run with two unknown programs, and encounter 10 wins in a
>row right at the start, what does that tell you?  It's a very unlikely outcome.
>Not impossible, but unlikely.
>
>You have to make a conclusion.  Does it make sense to say, "This could have been
>chance, I'll throw it out"?  It doesn't make sense.  Of course it could have
>been chance, but if you aren't willing to accept that it is much more likely
>that one of the programs is stronger than the other one, you may as well not
>play matches.
>
>Of course, if you *know* that they are the same strength, because they are the
>same engine, all you are doing is a probability exercise.  You are just flipping
>a coin a bunch of times and going "ooh" if you get a weird outcome.  And if you
>get a lot of insane outcomes your test is probably broken.
>
>The most important point I'd like to make with this, is that any match result
>can be achieved due to chance, assuming some hypothetical Elo delta between the
>two programs.
>
>Once again discounting the possibility of draws, a result of +40 -60 or worse
>for you will come up by chance almost 3% of the time, assuming the two programs
>are of equal strength.
>
>Why does this inspire more confidence than a result of +0 -10, which will come
>up by chance only about 0.1% of the time?
>
>You have two outcomes.  One is very unlikely, one is ridiculously unlikely.  Why
>are people willing to believe that there is a higher probability that the second
>one is due to chance?
>
>The only real gotcha with the +0 -10 result is that some fool might do a bunch
>of 10-game runs until they get that result, or that they'll pick a +0 -10 run
>out of a larger run and claim that it means something, which it doesn't.
>
>If you start two programs playing, and achieve a +0 -10, you've all but proven
>that the second is better than the first.  If you do a fixed 100-game match and
>get +40 -60, the odds of an erroneous conclusion are *much higher*, and so
>should be looked upon with more skepticism.
>
>>> The figure *should* [obviously] hover around 50% points scored for each side.
>>>It is very unlikely that the ten game match will be close to 50%.  The 100 game
>>>match will probably be fairly close.  It is rather unlikely that the 1000 game
>>>match will be far from 50%, but it is very unlikely it will be exactly 50%.  In
>>>fact, if it should be exactly 50%, the Chi-Squared Test will reject it!  It
>>>throws out both things that don't seem to fit the model and also things that fit
>>>so perfectly something looks fishy.
>>
>>
>>I think that 1000 game match might well be way off from 50%.  IE it is not
>>unlikely that one side will pull ahead a significant amount, and then they
>>start playing equally for the rest of the match.  But there is nothing in
>>statistics that says after you flip a coin and get 100 consecutive heads,
>>that sometime later you will get 100 consecutive tails to offset them.  It
>>is more likely that this series will simply end with heads being ahead 100
>>counts...
>
>In a 1000 game match, the odds that your program will score will be between 462
>and 538 are over 98%, assuming that it's exactly as strong as its opponent, and
>assuming that draws are not possible.
>
>With a draw percentage of 35%, you have a 98% chance of being between 470 and
>530.  About half the time you'll be between 491 and 509.
>
>With 100 games and 35% draws, about 98% of the time you'll be somewhere between
>41 and 59, and half the time you'll be between 47 and 53.
>
>So you have quite a bit less variation with 1000 than you do with 100.
>
>bruce



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.