Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Testing who is better, rating lists, and Chessmaster

Author: Osorio Meirelles
Date: 13:46:48 07/09/03

 I Agree!  One way to make the result unbiased is to play two tournaments:

 1) Use the vary same engine with the default settings as see what the natural
dispersion will be.  Let´s say it got a performance of 60%, which means
an aditional elo point of around 400*log( 60%/40%).

 2) Use different settings and play a tournament and then take the result of the
best engine.  Let's say we got 70%.  Which mean an aditional elo performance
comapred to the average elo of the settings, of
400*log( 70%/30%).

  Assuming that the average elo of the diffrent settings are the same as the
 default setting, then the real improvement would be

     400*log(70%/30%) - 400*log(60%/40%)

  In case we know that the average performance of the difference settings
  is 10 rating points above the default setting, then the real improvement would
be:

     400*log(70%/30%) - 400*log(60%/40%) + 10

 This way we can correct for the natural dispersion that happens when we play a
tournament and eventhough we can have a ajusted value in ELO points above the
default setting, there is still a small probability that there is no differece
between the two, specially if the number of games played is not so big.

Another way to correct this is to play a long number of games between
the best setting and the default setting and the verify the additional ELO
points. I doubt that it will be as good as the ELO found in the tournament with
different setting.


On July 09, 2003 at 03:19:09, Russell Reagan wrote:

>In recent rec.games.chess.computer, a question was asked regarding what the
>absolute strongest settings for Chessmaster were. The poster said he discovered
>that the default settings were weaker by at least 97 ELO points and sited
>Surak's rating list (http://www.grailmaster.com/misc/chess/comp/cm.html).
>
>Surak's rating list contains only ratings of different Chessmaster versions and
>settings. Even with different settings, it's all the same engine, so I don't
>imagine the rating differences could realistically be almost 100 ELO points (at
>least not 100 points above what the author believes to be the strongest).
>
>To illistrate this point, I decided to play a little tournament between engines
>that had similar ratings (still in progress). Here are the results, so far:
>
>                Score         1     2     3     4     5     6     7     8
>--------------------------------------------------------------------------
> 1: Engine G  19.0 / 30   XXXXX ==10. 1===. ===1. ==11. 01==1 11==. 110==
> 2: Engine D  16.5 / 30   ==01. XXXXX ====. ===== =11== =011. ==1=. 001=.
> 3: Engine F  15.0 / 30   0===. ====. XXXXX 0==1. =1=== 1=00. 1==== ==10.
> 4: Engine E  15.0 / 30   ===0. ===== 1==0. XXXXX =000. ===== ==11. ==11.
> 5: Engine C  15.0 / 30   ==00. =00== =0=== =111. XXXXX 1===. ==10. 0=11.
> 6: Engine B  14.0 / 30   10==0 =100. 0=11. ===== 0===. XXXXX =0=1. 01==.
> 7: Engine H  13.0 / 30   00==. ==0=. 0==== ==00. ==01. =1=0. XXXXX =11==
> 8: Engine A  12.5 / 30   001== 110=. ==01. ==00. 1=00. 10==. =00== XXXXX
>--------------------------------------------------------------------------
>120 games: +30 =69 -21
>
>    Program       Elo    +   -   Games   Score   Av.Op.  Draws
>  1 Engine G    : 2582  114  74    30    63.3 %   2487   53.3 %
>  2 Engine D    : 2531  127  61    30    55.0 %   2496   63.3 %
>  3 Engine C    : 2501   90  90    30    50.0 %   2501   53.3 %
>  4 Engine E    : 2500   76  76    30    50.0 %   2500   66.7 %
>  5 Engine F    : 2499   76  76    30    50.0 %   2499   66.7 %
>  6 Engine B    : 2482   73 130    30    46.7 %   2505   53.3 %
>  7 Engine H    : 2457   65 124    30    43.3 %   2504   60.0 %
>  8 Engine A    : 2449   87 122    30    41.7 %   2508   43.3 %
>
>This indicates a rating difference of 133 ELO points. The funny thing is, every
>engine is Crafty, the exact same binary, using the exact same settings. If this
>kind of testing can produce a difference of 133 rating points between the exact
>same engine, what does that say about a mere 97 rating point difference between
>the different Chessmaster settings?
>
>This tells me that when testing an improvement to an engine, you shouldn't use
>head to head results as a good indicator of whether or not the new version is
>actaully an improvement. Thoughts?
>
>Also, what does this say about the controversial issue surrounding whether or
>not the default settings of Chessmaster are indeed the strongest? It would seem
>that some other form of testing would be needed to demonstrate that, aside from
>playing a plethora of Chessmaster versions against one another. Maybe holding a
>tournament between default Chessmaster and a number of other strong engines
>(Fritz, Shredder, etc.), and then holding a second tournament between the
>proposed "better" Chessmaster settings and the same set of strong engines (as
>Kurt Utzinger is doing now).
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.