Author: Osorio Meirelles
Date: 13:46:48 07/09/03
Go up one level in this thread
I Agree! One way to make the result unbiased is to play two tournaments:
1) Use the vary same engine with the default settings as see what the natural
dispersion will be. Let´s say it got a performance of 60%, which means
an aditional elo point of around 400*log( 60%/40%).
2) Use different settings and play a tournament and then take the result of the
best engine. Let's say we got 70%. Which mean an aditional elo performance
comapred to the average elo of the settings, of
400*log( 70%/30%).
Assuming that the average elo of the diffrent settings are the same as the
default setting, then the real improvement would be
400*log(70%/30%) - 400*log(60%/40%)
In case we know that the average performance of the difference settings
is 10 rating points above the default setting, then the real improvement would
be:
400*log(70%/30%) - 400*log(60%/40%) + 10
This way we can correct for the natural dispersion that happens when we play a
tournament and eventhough we can have a ajusted value in ELO points above the
default setting, there is still a small probability that there is no differece
between the two, specially if the number of games played is not so big.
Another way to correct this is to play a long number of games between
the best setting and the default setting and the verify the additional ELO
points. I doubt that it will be as good as the ELO found in the tournament with
different setting.
On July 09, 2003 at 03:19:09, Russell Reagan wrote:
>In recent rec.games.chess.computer, a question was asked regarding what the
>absolute strongest settings for Chessmaster were. The poster said he discovered
>that the default settings were weaker by at least 97 ELO points and sited
>Surak's rating list (http://www.grailmaster.com/misc/chess/comp/cm.html).
>
>Surak's rating list contains only ratings of different Chessmaster versions and
>settings. Even with different settings, it's all the same engine, so I don't
>imagine the rating differences could realistically be almost 100 ELO points (at
>least not 100 points above what the author believes to be the strongest).
>
>To illistrate this point, I decided to play a little tournament between engines
>that had similar ratings (still in progress). Here are the results, so far:
>
> Score 1 2 3 4 5 6 7 8
>--------------------------------------------------------------------------
> 1: Engine G 19.0 / 30 XXXXX ==10. 1===. ===1. ==11. 01==1 11==. 110==
> 2: Engine D 16.5 / 30 ==01. XXXXX ====. ===== =11== =011. ==1=. 001=.
> 3: Engine F 15.0 / 30 0===. ====. XXXXX 0==1. =1=== 1=00. 1==== ==10.
> 4: Engine E 15.0 / 30 ===0. ===== 1==0. XXXXX =000. ===== ==11. ==11.
> 5: Engine C 15.0 / 30 ==00. =00== =0=== =111. XXXXX 1===. ==10. 0=11.
> 6: Engine B 14.0 / 30 10==0 =100. 0=11. ===== 0===. XXXXX =0=1. 01==.
> 7: Engine H 13.0 / 30 00==. ==0=. 0==== ==00. ==01. =1=0. XXXXX =11==
> 8: Engine A 12.5 / 30 001== 110=. ==01. ==00. 1=00. 10==. =00== XXXXX
>--------------------------------------------------------------------------
>120 games: +30 =69 -21
>
> Program Elo + - Games Score Av.Op. Draws
> 1 Engine G : 2582 114 74 30 63.3 % 2487 53.3 %
> 2 Engine D : 2531 127 61 30 55.0 % 2496 63.3 %
> 3 Engine C : 2501 90 90 30 50.0 % 2501 53.3 %
> 4 Engine E : 2500 76 76 30 50.0 % 2500 66.7 %
> 5 Engine F : 2499 76 76 30 50.0 % 2499 66.7 %
> 6 Engine B : 2482 73 130 30 46.7 % 2505 53.3 %
> 7 Engine H : 2457 65 124 30 43.3 % 2504 60.0 %
> 8 Engine A : 2449 87 122 30 41.7 % 2508 43.3 %
>
>This indicates a rating difference of 133 ELO points. The funny thing is, every
>engine is Crafty, the exact same binary, using the exact same settings. If this
>kind of testing can produce a difference of 133 rating points between the exact
>same engine, what does that say about a mere 97 rating point difference between
>the different Chessmaster settings?
>
>This tells me that when testing an improvement to an engine, you shouldn't use
>head to head results as a good indicator of whether or not the new version is
>actaully an improvement. Thoughts?
>
>Also, what does this say about the controversial issue surrounding whether or
>not the default settings of Chessmaster are indeed the strongest? It would seem
>that some other form of testing would be needed to demonstrate that, aside from
>playing a plethora of Chessmaster versions against one another. Maybe holding a
>tournament between default Chessmaster and a number of other strong engines
>(Fritz, Shredder, etc.), and then holding a second tournament between the
>proposed "better" Chessmaster settings and the same set of strong engines (as
>Kurt Utzinger is doing now).
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.