Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: If 75 Games are not considered a Statistical proof, neither is the SSDF.

Author: Robert Hyatt

Date: 18:24:13 01/30/01

Go up one level in this thread


On January 30, 2001 at 17:42:59, Dann Corbit wrote:

>On January 30, 2001 at 15:57:25, Bruce Moreland wrote:
>
>>On January 30, 2001 at 09:06:09, Jorge Pichard wrote:
>>
>>>Ever since I matched Nimzo 8 vs Junior 6 using my AMD K6-2 500 MHz and also
>>>matched them using my Athlon 800 MHz at G\60 and got different scores; some
>>>people argued that those games were not statistically significants to proof
>>>anything at all. Then we must disregard the SSDF rating list, since each Chess
>>>program only play 40 games against each other and not 200 games.
>>>
>>>PS: I am still convinced that Nimzo 8 is one of the few programs just like
>>>Gandalf 4.32 that benefit the most by using the best hardware available. And
>>>they are not programmed specifically to outperform Fritz 6 on a particular
>>>hardware such as the AMD K6-2 450 MHz.
>>>
>>>Pichard.
>>
>>When you play a match you get some information.  You can see how the programs
>>played against each other, you know who won the match, and you know the score.
>>
>>It's perfectly valid to say that A beat B, if A won the match.  That's a simple
>>fact.  And you can look at the match and say that A played better than B.
>>That's more subjective, but perhaps you are expert enough that what you say is
>>true.
>>
>>You also have the score of the match.  If A beats B in a three-game match by a
>>score of 2-1, you have some data that you can use to make a judgement about how
>>A compares with B.  That A won the match is beyond question, but it is still an
>>open issue as to whether A is better than B in the completely true sense.
>>
>>You can assert that A is better than B, but you can also assert that B is better
>>than A.  In this particular case, there is not a lot of difference between these
>>two assertions.  If two programs are equal, and they are known to draw 30% of
>>the time, the odds that one would beat the other by a score of 2-1 are almost
>>40%.  So it is more likely than not that A is better than B, assuming that A is
>>at least the tiniest bit better than B, but there's also a 40% chance that B is
>>at least a tiny bit better than A.
>>
>>As you play more games, it is possible that you can make your "A is better than
>>B" conclusion with a higher chance of accuracy, but this is not necessarily
>>true.
>>
>>If you play some games, and A completely wipes out B in terms of match score,
>>you can assert that A is better than B, with fairly good reliability, but if you
>>play a lot more games, and the score is close, your chance of accuracy may
>>actually be less.
>>
>>In order to make a good claim that A is better than B, A needs to beat B by such
>>a score that the odds of the score being due to chance are quite low.  In
>>practice, this takes a lot of games, unless the match is a blowout.
>>
>>The closer two programs are to each other in terms of strength, the more games
>>will probably be necessary in order to prove with reasonable accuracy that one
>>is at least a little bit better than the other.
>>
>>There is no rule of thumb about how many games is enough, it depends completely
>>upon the score of the match.
>>
>>When you talk about the SSDF list, that's a different thing.  The games are
>>played as a series of matches, but your score in the match doesn't determine
>>your position on the list, your total score against all opponents does.  It
>>would probably actually be better if there were many more opponents and the
>>matches were shorter, in terms of figuring out who deserves the top spot on the
>>list, if what you are trying to do is measure general chess strength.
>>
>
>Additional measurements will not (in general) make the answer less accurate
>(unless something is wrong with the measurements).  However, if two programs are
>about equal, you will [basically] never determine which is stronger by playing
>them against each other.  For anyone who would like to prove this to themselves,
>just play a program against itself 10 times, 50 times, 100 times and 1000 times.

This is what most miss, statistics-wise.  For someone with the time, it would
be interesting to have them use xboard/winboard and play a bunch of 100 game
matches between the _same_ version of an engine.  The outcome can be absolutely
astounding.  IE out of 100 games, you might get 30 wins, 10 losses, 60 draws
by A.  The next time you get 60 wins, 20 losses, 20 draws by A.  You begin
to conlcude A is better until you realize A and B are _identical_.  You run
the test again and B wins this time.

I ran a _bunch_ of 100 game matches to convince me that no-recapture was better
than using recapture.  In my program.  But the first two 100 game matches had
me convinced that using recapture was better.  But the next 50 matches had
no-recapture winning almost all, although never by more than 10-20 points
total (out of 100 total).

Pretty interesting stuff.  If I take that first 100 game match, and extract
any consecutive N games I choose, I can produce any outcome I would want.

IE N-0-0, 0-N-0, 0-0-N, etc.  Which means that a small number of games is
hardly more than a crap-shoot.  Which makes you wonder what any tournament
winner means at all.  :)





> The figure *should* [obviously] hover around 50% points scored for each side.
>It is very unlikely that the ten game match will be close to 50%.  The 100 game
>match will probably be fairly close.  It is rather unlikely that the 1000 game
>match will be far from 50%, but it is very unlikely it will be exactly 50%.  In
>fact, if it should be exactly 50%, the Chi-Squared Test will reject it!  It
>throws out both things that don't seem to fit the model and also things that fit
>so perfectly something looks fishy.


I think that 1000 game match might well be way off from 50%.  IE it is not
unlikely that one side will pull ahead a significant amount, and then they
start playing equally for the rest of the match.  But there is nothing in
statistics that says after you flip a coin and get 100 consecutive heads,
that sometime later you will get 100 consecutive tails to offset them.  It
is more likely that this series will simply end with heads being ahead 100
counts...




>;-)



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.