Author: Christophe Theron
Date: 11:40:35 01/28/00
Go up one level in this thread
On January 28, 2000 at 07:27:54, Enrique Irazoqui wrote:
>There is a degree of uncertainty, but I don't think you need 1000 matches of 200
>games each to have an idea of who is best.
>
>Fischer became a chess legend for the games he played between his comeback in
>1970 to the Spassky match of 1972. In this period of time he played 157 games
>that proved to all of us without the hint of a doubt that he was the very best
>chess player of those times.
>
>Kasparov has been the undisputed best for many years. From 1984 until now, he
>played a total of 772 rated games. He needed less than half these games to
>convince everyone about who is the best chess player.
>
>This makes more sense to me than the probability stuff of your Qbasic program.
>Otherwise we would reach the absurd of believing that all the rankings in the
>history of chess are meaningless, and Capablanca, Fischer and Kasparov had long
>streaks of luck.
>
>You must have thought along these lines too when you proposed the matches
>Tiger-Diep and Tiger-Crafty as being meaningful, in spite of not being 200,000
>games long.
>
>Enrique
Enrique, I'm not sure you understand me.
What my little QBasic program will tell you, if you try it, is that when the two
programs are very close in strength you need an incredible number of games in
order to determine which one is best.
And when the elo difference between the programs is high enough, a small number
of games is enough.
From my RNDMATCH program, I have derived the following table:
Reliability of chess matches (this table is reliable with a 80% confidence)
10 games: 14.0% (105 pts)
20 games: 11.0% ( 77 pts)
30 games: 9.0% ( 63 pts)
40 games: 8.0% ( 56 pts)
50 games: 7.0% ( 49 pts)
100 games: 5.0% ( 35 pts)
200 games: 3.5% ( 25 pts)
400 games: 2.5% ( 18 pts)
600 games: 2.2% ( 15 pts)
I hope others will have a critical look at my table and correct my maths if
needed.
What this table tells you, is that with a 10 games match you can say that one
program is better ONLY if it gets a result above 64% (50+14.0). In this case you
can, with 80% chances to be right, say that this program is at least 105 elo
points better than its opponent.
Note that you have still 20% chances to be wrong. But for pratical use I think
it's enough.
I don't think this result sounds counter intuitive to most of us here.
Now if you play 20 games you can detect, with a 80% confidence, if one program
is 77 elo points better than its opponent. No revolution here I think.
Play 40 games and you can, with 80% confidence, be sure that one program is 56
elo points better.
What's very important and, I think, overlooked by most testers, is that when the
elo difference between two programs is tiny, the number of games to play becomes
tremendous.
For example, if the programs are separated by only 18 elo points, you need to
play 400 GAMES! If you don't, you CANNOT DRAW ANY CONCLUSION.
The right methodology when you do a match between two programs is this: you must
play on until the winning percentage of one of the programs gets decisive.
After 10 games, if no program wins by 64.0% or more => play on
After 20 games, if no program wins by 61.0% or more => play on
After 40 games, if no program wins by 58.0% or more => play on
After 100 games, if no program wins by 55.0% or more => play on
After 200 games, if no program wins by 53.5% or more => play on
And so on.
If you play two identical programs, you are likely to play on forever. That
sounds strange, but it's only logical.
And to answer your question, I thought that playing 40 games between Tiger and
Diep and 40 games between Tiger and Crafty would be enough, because I think the
difference between Tiger and these programs is above 56 elo points.
Christophe
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.