Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Shredder crushing Chess Tiger.

Author: Andrew Dados
Date: 12:26:11 12/15/03
On December 15, 2003 at 12:39:26, Christophe Theron wrote:

>On December 15, 2003 at 10:24:45, Andrew Dados wrote:
>
>>On December 15, 2003 at 01:25:40, Christophe Theron wrote:
>>
>>>On December 14, 2003 at 19:26:30, J F wrote:
>>>
>>>>Christophe, How many games do you recomend playing before you can draw a
>>>>conclusion?
>>>
>>>
>>>
>>>I think you are not going to like the answer. :)
>>>
>>>It depends on:
>>>* the reliability you want (do you want a 70% reliability? 80%? 90%? 95%?)
>>>* the elo difference between the programs
>>>
>>>If you want a very good reliability in the result (for example 95%) and the two
>>>programs are very close in elo, then you might need several thousands games.
>>>
>>>There is no simple answer to your question. However, I know that there exist a
>>>program called "whoisbetter" that can, given a match result, tell you if one
>>>program can be considered better than his opponent.
>>>
>>>The very important thing to remember is that in order to know which of the top
>>>PC chess programs is better, you will definitely need several thousands of
>>>games, believe it or not. So it's always funny to see somebody giving an opinion
>>>after 5 games.
>>>
>>>
>>>Below is a table that can be used to get an idea of the number of games to play
>>>to get a given error margin (in winning percentage and in elo difference) for a
>>>given reliability (percentage of confidence).
>>>
>>>The tables say that, for example, if you want to know with 90% reliability which
>>>opponent is better you will have to play 1000 games if their elo difference is
>>>15 points. If their elo difference is below 10 points, you will have to play
>>>more than 2000 games...
>>>
>>>Reliability of chess matches
>>>
>>>90% confidence
>>>Games    %err+/-    elo+/-
>>>    10     20        140pts
>>>    20     15        105pts
>>>    25     14         98pts
>>>    30     12         63pts
>>>    40     10         70pts
>>>    50      9         56pts
>>>   100      6.5       35pts
>>>   200      4.72      33pts
>>>   400      3.34      23pts
>>>   600      2.66      19pts
>>>   800      2.39      17pts
>>>  1000      2.12      15pts
>>>  1200      2.00      14pts
>>>  1400      1.81      13pts
>>>  1600      1.66      12pts
>>>  2000     ~1.50      11pts
>>>
>>>80% confidence
>>>Games    %err+/-    elo+/-
>>>    10     15        105pts
>>>    20     11         77pts
>>>    25     10         70pts
>>>    30      9         63pts
>>>    40      8         56pts
>>>    50      7         49pts
>>>   100      5.0       35pts
>>>   200      3.75      26pts
>>>   400      2.60      18pts
>>>   600      2.15      15pts
>>>   800      1.86      13pts
>>>  1000      1.66      12pts
>>>  1200      1.46      10pts
>>>  1400      1.40      10pts
>>>  1600      1.34       9pts
>>>
>>>70% confidence
>>>Games    %err+/-    elo+/-
>>>    10     15         105pts
>>>    20     10          70pts
>>>    25      8          56pts
>>>    30      8          56pts
>>>    40      6.3        44pts
>>>    50      6.0        42pts
>>>   100      4.0        28pts
>>>   200      3.0        21pts
>>>   400      2.2        15pts
>>>   600      1.7        12pts
>>>   800      1.5        11pts
>>>  1000      1.3         9pts
>>>  1200      1.24        9pts
>>>  1400      1.14        8pts
>>>  1600      1.04        7pts
>>>
>>>
>>>
>>>    Christophe
>>
>>I always wondered how those tables are calculated. Since we have no model which
>>includes draw scores and draw possibilities in any satisfactory way all those
>>tables are just guessed (or most likely draw score possibilities are just
>>ignored).
>>
>>If draws and their chances are ignored, divide games column number by 2 is best
>>guess - each chess game has 3 outcomes, not 2, so every game equals to 2 coin
>>tosses not one (roughly, draw percent depends on opponent strength and this is
>>the problem here: we don't know what is expected percent of draw games).
>>
>>whoisbetter is one example of statistic ignoring one of 3 possible scores (it
>>comes to extreme), and thus produces incorrect probabilities.
>>
>>-Andrew-
>
>
>
>I'm not good enough at statistics to have produced these tables from a formula.
>
>I have built these tables empirically: with a program producing random outcome
>of chess games with the chances to win, draw or lose being equal. This is were
>my logic is biased, the chances to win for white seem to be higher than just
>1/3.
>
>The tables have been produced by generating a very high number of simulated
>matches and then crunching the numbers.
>
>I expect my results to be close to theorical results. I have published these
>tables several times and I have always asked for somebody to give me better
>estimates. I'm still waiting.
>
>
>
>    Christophe

Ok, now it makes more sense to me. Still same question remain:
 If one program is better by 100 elo, what is chance of draw outcome in single
game? (and consequently what is w/d/l distribution) Simple model assumes this
should not depend on their average strength, yet in practice it makes big
difference (of course more draws as players strength increase). Also your note
about biased score towards white adds some complexity.

 Since we have no idea what is expected distribution of w/d/l (you assumed 1/3
each), we can't correctly predict win/lose chances. Could you some day possibly
rerun your simulation with different w/d/l distribution (but yielding same
rating difference)? I am curious how stable are the numbers in that table...

My very simple simulation:
For program A better then B by 100 elo expected score is 0.69 . Lets play a 10
game match (100 000 times):

a) assuming win chance of 0.59 and draw chance of 0.2:
A wins 0.895% matches, draws 0.050% and loses 0.054%

b) assuming win chance of 0.49 and draw chance of 0.4 (so same expected score):
A wins 0.934% matches, draws 0.039% and loses 0.025%

While I still have no idea what would be real chance of draw between those
programs, I can say it influences our expected score table (even error column)
greatly...

-Andrew-
Re: Shredder crushing Chess Tiger. Andrew Dados 12:59:22 12/15/03
- Reliability of chess matches Christophe Theron 16:00:24 12/15/03
Re: Shredder crushing Chess Tiger. Dieter Buerssner 12:55:37 12/15/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.