Computer Chess Club Archives


Search

Terms

Messages

Subject: Reliability of chess matches

Author: Christophe Theron

Date: 16:00:24 12/15/03

Go up one level in this thread


On December 15, 2003 at 15:59:22, Andrew Dados wrote:

>On December 15, 2003 at 15:26:11, Andrew Dados wrote:
>
>>On December 15, 2003 at 12:39:26, Christophe Theron wrote:
>>
>>>On December 15, 2003 at 10:24:45, Andrew Dados wrote:
>>>
>>>>On December 15, 2003 at 01:25:40, Christophe Theron wrote:
>>>>
>>>>>On December 14, 2003 at 19:26:30, J F wrote:
>>>>>
>>>>>>Christophe, How many games do you recomend playing before you can draw a
>>>>>>conclusion?
>>>>>
>>>>>
>>>>>
>>>>>I think you are not going to like the answer. :)
>>>>>
>>>>>It depends on:
>>>>>* the reliability you want (do you want a 70% reliability? 80%? 90%? 95%?)
>>>>>* the elo difference between the programs
>>>>>
>>>>>If you want a very good reliability in the result (for example 95%) and the two
>>>>>programs are very close in elo, then you might need several thousands games.
>>>>>
>>>>>There is no simple answer to your question. However, I know that there exist a
>>>>>program called "whoisbetter" that can, given a match result, tell you if one
>>>>>program can be considered better than his opponent.
>>>>>
>>>>>The very important thing to remember is that in order to know which of the top
>>>>>PC chess programs is better, you will definitely need several thousands of
>>>>>games, believe it or not. So it's always funny to see somebody giving an opinion
>>>>>after 5 games.
>>>>>
>>>>>
>>>>>Below is a table that can be used to get an idea of the number of games to play
>>>>>to get a given error margin (in winning percentage and in elo difference) for a
>>>>>given reliability (percentage of confidence).
>>>>>
>>>>>The tables say that, for example, if you want to know with 90% reliability which
>>>>>opponent is better you will have to play 1000 games if their elo difference is
>>>>>15 points. If their elo difference is below 10 points, you will have to play
>>>>>more than 2000 games...
>>>>>
>>>>>Reliability of chess matches
>>>>>
>>>>>90% confidence
>>>>>Games    %err+/-    elo+/-
>>>>>    10     20        140pts
>>>>>    20     15        105pts
>>>>>    25     14         98pts
>>>>>    30     12         63pts
>>>>>    40     10         70pts
>>>>>    50      9         56pts
>>>>>   100      6.5       35pts
>>>>>   200      4.72      33pts
>>>>>   400      3.34      23pts
>>>>>   600      2.66      19pts
>>>>>   800      2.39      17pts
>>>>>  1000      2.12      15pts
>>>>>  1200      2.00      14pts
>>>>>  1400      1.81      13pts
>>>>>  1600      1.66      12pts
>>>>>  2000     ~1.50      11pts
>>>>>
>>>>>80% confidence
>>>>>Games    %err+/-    elo+/-
>>>>>    10     15        105pts
>>>>>    20     11         77pts
>>>>>    25     10         70pts
>>>>>    30      9         63pts
>>>>>    40      8         56pts
>>>>>    50      7         49pts
>>>>>   100      5.0       35pts
>>>>>   200      3.75      26pts
>>>>>   400      2.60      18pts
>>>>>   600      2.15      15pts
>>>>>   800      1.86      13pts
>>>>>  1000      1.66      12pts
>>>>>  1200      1.46      10pts
>>>>>  1400      1.40      10pts
>>>>>  1600      1.34       9pts
>>>>>
>>>>>70% confidence
>>>>>Games    %err+/-    elo+/-
>>>>>    10     15         105pts
>>>>>    20     10          70pts
>>>>>    25      8          56pts
>>>>>    30      8          56pts
>>>>>    40      6.3        44pts
>>>>>    50      6.0        42pts
>>>>>   100      4.0        28pts
>>>>>   200      3.0        21pts
>>>>>   400      2.2        15pts
>>>>>   600      1.7        12pts
>>>>>   800      1.5        11pts
>>>>>  1000      1.3         9pts
>>>>>  1200      1.24        9pts
>>>>>  1400      1.14        8pts
>>>>>  1600      1.04        7pts
>>>>>
>>>>>
>>>>>
>>>>>    Christophe
>>>>
>>>>I always wondered how those tables are calculated. Since we have no model which
>>>>includes draw scores and draw possibilities in any satisfactory way all those
>>>>tables are just guessed (or most likely draw score possibilities are just
>>>>ignored).
>>>>
>>>>If draws and their chances are ignored, divide games column number by 2 is best
>>>>guess - each chess game has 3 outcomes, not 2, so every game equals to 2 coin
>>>>tosses not one (roughly, draw percent depends on opponent strength and this is
>>>>the problem here: we don't know what is expected percent of draw games).
>>>>
>>>>whoisbetter is one example of statistic ignoring one of 3 possible scores (it
>>>>comes to extreme), and thus produces incorrect probabilities.
>>>>
>>>>-Andrew-
>>>
>>>
>>>
>>>I'm not good enough at statistics to have produced these tables from a formula.
>>>
>>>I have built these tables empirically: with a program producing random outcome
>>>of chess games with the chances to win, draw or lose being equal. This is were
>>>my logic is biased, the chances to win for white seem to be higher than just
>>>1/3.
>>>
>>>The tables have been produced by generating a very high number of simulated
>>>matches and then crunching the numbers.
>>>
>>>I expect my results to be close to theorical results. I have published these
>>>tables several times and I have always asked for somebody to give me better
>>>estimates. I'm still waiting.
>>>
>>>
>>>
>>>    Christophe
>>
>>Ok, now it makes more sense to me. Still same question remain:
>> If one program is better by 100 elo, what is chance of draw outcome in single
>>game? (and consequently what is w/d/l distribution) Simple model assumes this
>>should not depend on their average strength, yet in practice it makes big
>>difference (of course more draws as players strength increase). Also your note
>>about biased score towards white adds some complexity.
>>
>> Since we have no idea what is expected distribution of w/d/l (you assumed 1/3
>>each), we can't correctly predict win/lose chances. Could you some day possibly
>>rerun your simulation with different w/d/l distribution (but yielding same
>>rating difference)? I am curious how stable are the numbers in that table...
>>
>>My very simple simulation:
>>For program A better then B by 100 elo expected score is 0.69 . Lets play a 10
>>game match (100 000 times):
>>
>>a) assuming win chance of 0.59 and draw chance of 0.2:
>>A wins 89.5% matches, draws 5.0% and loses 5.4%
>>
>>b) assuming win chance of 0.49 and draw chance of 0.4 (so same expected score):
>>A wins 93.4% matches, draws 3.9% and loses 2.5%
>>
>>While I still have no idea what would be real chance of draw between those
>>programs, I can say it influences our expected score table (even error column)
>>greatly...
>
>(obvious decimal error corrected with % scores :)
>
>Note somewhat paradoxical result: the higher the chance of draw outcome in
>single game, the less chance that better player will lose (or draw) the match.
>
>...And since more draws happen between stronger players, confidence of 10-game
>match is higher towards the top. Maybe much higher.
>
>
>>
>>-Andrew-



I have changed the subject line, I hope you don't mind, because it sucked! :)

I'm not sure I can answer your question but I would really LOVE to see somebody
finally trying to answer it with a better approach than mine.

I have assumed that the probabilities and error bars would not change much if
w/d/l probability was taken as 40/30/30 for example (seems to be a little more
realistic) but maybe I'm dead wrong.

As usual, I had to be pragmatic. I needed to be able to evaluate the results of
my matches in order to say that a change was an improvement or not. So I had to
move forward and decided to use the above table with a possibly slightly wrong
assumption about 1/3-1/3-1/3.

As I told you I have posted the above table several times in the hope that
somebody would take the time to correct it.

Here is the ugly QBasic program I have used to build it, maybe it will help (I
think I have already posted it because it is in english, but it did not help to
find a volunteer last time):


CLS
PRINT "*** Simulation of chess matches ***"
PRINT
PRINT "We assume that opponents are exactly of equal strength."
PRINT "And that (win;draw;loss) probablities are (1/3; 1/3; 1/3)."
PRINT

RANDOMIZE TIMER

DIM nbmatch AS INTEGER
DIM nbgames AS INTEGER
DIM limit AS SINGLE

INPUT "Number of matches to play "; nbmatch
INPUT "Number of games in each match "; nbgames
INPUT "Compute probability of error greater than "; limit

DIM totdiff AS SINGLE
totdiff = 0
DIM maxdiff AS SINGLE
maxdiff = 0
DIM nbmax AS INTEGER
nbmax = 0
DIM nboverlimit AS INTEGER
nboverlimit = 0

DIM total AS SINGLE
DIM percent AS SINGLE
DIM diff AS SINGLE
DIM result AS INTEGER
DIM game AS INTEGER


FOR match = 1 TO nbmatch

  total = 0       ' total is the score of player A

  FOR game = 1 TO nbgames

    result = INT(RND * 300) ' result between 0 and 299 included

    IF result < 100 THEN
      ' DRAW
      'PRINT "1/2 - 1/2"
      total = total + .5
    ELSE
      IF result < 200 THEN
        ' PLAYER A LOSES: total stays unchanged
        'PRINT " 0  -  1"
      ELSE
        ' PLAYER A WINS
        'PRINT " 1  -  0"
        total = total + 1
      END IF
    END IF

  NEXT game

  percent = total / nbgames * 100!
  diff = ABS(percent - 50)

  PRINT match;
  LOCATE CSRLIN, 1
  'PRINT "Outcome of the match: "; total; "-"; nbgames - total; "  (";
  'PRINT USING "###.##"; percent;
  'PRINT "%, error=";
  'PRINT USING "##.##"; diff;
  'PRINT " )"

  IF diff > maxdiff THEN
    maxdiff = diff
    nbmax = 0
  END IF
  IF diff = maxdiff THEN nbmax = nbmax + 1
  totdiff = totdiff + diff
  IF diff > limit THEN nboverlimit = nboverlimit + 1

NEXT match

PRINT "      "
PRINT "Maximum error: ";
PRINT USING "##.##"; maxdiff;
PRINT "   (elo diff = ";
PRINT USING "###"; maxdiff * 7!;
PRINT ")   occured in ";
PRINT USING "###.###"; nbmax / nbmatch;
PRINT "% of the matches"
PRINT "Average error: ";
PRINT USING "##.##"; totdiff / nbmatch;
PRINT "   (elo diff = ";
PRINT USING "###"; totdiff / nbmatch * 7!;
PRINT ")"
PRINT "Prob( error > ";
PRINT USING "##.##"; limit;
PRINT "% ) = ";
PRINT "Prob( elo diff > ";
PRINT USING "###"; limit * 7!;
PRINT " ) = ";
PRINT USING "##.##"; nboverlimit / nbmatch * 100!;
PRINT "%"



Good luck.



    Christophe



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.