Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: If 75 Games are not considered a Statistical proof, neither is the SSDF.

Author: Dann Corbit

Date: 14:42:59 01/30/01

Go up one level in this thread


On January 30, 2001 at 15:57:25, Bruce Moreland wrote:

>On January 30, 2001 at 09:06:09, Jorge Pichard wrote:
>
>>Ever since I matched Nimzo 8 vs Junior 6 using my AMD K6-2 500 MHz and also
>>matched them using my Athlon 800 MHz at G\60 and got different scores; some
>>people argued that those games were not statistically significants to proof
>>anything at all. Then we must disregard the SSDF rating list, since each Chess
>>program only play 40 games against each other and not 200 games.
>>
>>PS: I am still convinced that Nimzo 8 is one of the few programs just like
>>Gandalf 4.32 that benefit the most by using the best hardware available. And
>>they are not programmed specifically to outperform Fritz 6 on a particular
>>hardware such as the AMD K6-2 450 MHz.
>>
>>Pichard.
>
>When you play a match you get some information.  You can see how the programs
>played against each other, you know who won the match, and you know the score.
>
>It's perfectly valid to say that A beat B, if A won the match.  That's a simple
>fact.  And you can look at the match and say that A played better than B.
>That's more subjective, but perhaps you are expert enough that what you say is
>true.
>
>You also have the score of the match.  If A beats B in a three-game match by a
>score of 2-1, you have some data that you can use to make a judgement about how
>A compares with B.  That A won the match is beyond question, but it is still an
>open issue as to whether A is better than B in the completely true sense.
>
>You can assert that A is better than B, but you can also assert that B is better
>than A.  In this particular case, there is not a lot of difference between these
>two assertions.  If two programs are equal, and they are known to draw 30% of
>the time, the odds that one would beat the other by a score of 2-1 are almost
>40%.  So it is more likely than not that A is better than B, assuming that A is
>at least the tiniest bit better than B, but there's also a 40% chance that B is
>at least a tiny bit better than A.
>
>As you play more games, it is possible that you can make your "A is better than
>B" conclusion with a higher chance of accuracy, but this is not necessarily
>true.
>
>If you play some games, and A completely wipes out B in terms of match score,
>you can assert that A is better than B, with fairly good reliability, but if you
>play a lot more games, and the score is close, your chance of accuracy may
>actually be less.
>
>In order to make a good claim that A is better than B, A needs to beat B by such
>a score that the odds of the score being due to chance are quite low.  In
>practice, this takes a lot of games, unless the match is a blowout.
>
>The closer two programs are to each other in terms of strength, the more games
>will probably be necessary in order to prove with reasonable accuracy that one
>is at least a little bit better than the other.
>
>There is no rule of thumb about how many games is enough, it depends completely
>upon the score of the match.
>
>When you talk about the SSDF list, that's a different thing.  The games are
>played as a series of matches, but your score in the match doesn't determine
>your position on the list, your total score against all opponents does.  It
>would probably actually be better if there were many more opponents and the
>matches were shorter, in terms of figuring out who deserves the top spot on the
>list, if what you are trying to do is measure general chess strength.
>

Additional measurements will not (in general) make the answer less accurate
(unless something is wrong with the measurements).  However, if two programs are
about equal, you will [basically] never determine which is stronger by playing
them against each other.  For anyone who would like to prove this to themselves,
just play a program against itself 10 times, 50 times, 100 times and 1000 times.
 The figure *should* [obviously] hover around 50% points scored for each side.
It is very unlikely that the ten game match will be close to 50%.  The 100 game
match will probably be fairly close.  It is rather unlikely that the 1000 game
match will be far from 50%, but it is very unlikely it will be exactly 50%.  In
fact, if it should be exactly 50%, the Chi-Squared Test will reject it!  It
throws out both things that don't seem to fit the model and also things that fit
so perfectly something looks fishy.
;-)



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.