Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Testing who is better, rating lists, and Chessmaster

Author: Dann Corbit

Date: 00:30:03 07/09/03

Go up one level in this thread


On July 09, 2003 at 03:19:09, Russell Reagan wrote:

>In recent rec.games.chess.computer, a question was asked regarding what the
>absolute strongest settings for Chessmaster were. The poster said he discovered
>that the default settings were weaker by at least 97 ELO points and sited
>Surak's rating list (http://www.grailmaster.com/misc/chess/comp/cm.html).
>
>Surak's rating list contains only ratings of different Chessmaster versions and
>settings. Even with different settings, it's all the same engine, so I don't
>imagine the rating differences could realistically be almost 100 ELO points (at
>least not 100 points above what the author believes to be the strongest).
>
>To illistrate this point, I decided to play a little tournament between engines
>that had similar ratings (still in progress). Here are the results, so far:
>
>                Score         1     2     3     4     5     6     7     8
>--------------------------------------------------------------------------
> 1: Engine G  19.0 / 30   XXXXX ==10. 1===. ===1. ==11. 01==1 11==. 110==
> 2: Engine D  16.5 / 30   ==01. XXXXX ====. ===== =11== =011. ==1=. 001=.
> 3: Engine F  15.0 / 30   0===. ====. XXXXX 0==1. =1=== 1=00. 1==== ==10.
> 4: Engine E  15.0 / 30   ===0. ===== 1==0. XXXXX =000. ===== ==11. ==11.
> 5: Engine C  15.0 / 30   ==00. =00== =0=== =111. XXXXX 1===. ==10. 0=11.
> 6: Engine B  14.0 / 30   10==0 =100. 0=11. ===== 0===. XXXXX =0=1. 01==.
> 7: Engine H  13.0 / 30   00==. ==0=. 0==== ==00. ==01. =1=0. XXXXX =11==
> 8: Engine A  12.5 / 30   001== 110=. ==01. ==00. 1=00. 10==. =00== XXXXX
>--------------------------------------------------------------------------
>120 games: +30 =69 -21
>
>    Program       Elo    +   -   Games   Score   Av.Op.  Draws
>  1 Engine G    : 2582  114  74    30    63.3 %   2487   53.3 %

2582 - 74 = 2510

>  2 Engine D    : 2531  127  61    30    55.0 %   2496   63.3 %
>  3 Engine C    : 2501   90  90    30    50.0 %   2501   53.3 %
>  4 Engine E    : 2500   76  76    30    50.0 %   2500   66.7 %
>  5 Engine F    : 2499   76  76    30    50.0 %   2499   66.7 %
>  6 Engine B    : 2482   73 130    30    46.7 %   2505   53.3 %
>  7 Engine H    : 2457   65 124    30    43.3 %   2504   60.0 %
>  8 Engine A    : 2449   87 122    30    41.7 %   2508   43.3 %

2449 + 87 = 2536

>
>This indicates a rating difference of 133 ELO points. The funny thing is, every
>engine is Crafty, the exact same binary, using the exact same settings. If this
>kind of testing can produce a difference of 133 rating points between the exact
>same engine, what does that say about a mere 97 rating point difference between
>the different Chessmaster settings?

About what one would expect, depending upon the number of games that have been
played.

>This tells me that when testing an improvement to an engine, you shouldn't use
>head to head results as a good indicator of whether or not the new version is
>actaully an improvement. Thoughts?

The worst possible opponent is one that is exactly your strength.  There, the
random walk effect is multiplied.

The best possible opponents are a lot stronger or a lot weaker, but not
dominatingly.

So if you win 10% of the points in a long match or win 90% of the points in a
long match, then you have a good indication.

>Also, what does this say about the controversial issue surrounding whether or
>not the default settings of Chessmaster are indeed the strongest? It would seem
>that some other form of testing would be needed to demonstrate that, aside from
>playing a plethora of Chessmaster versions against one another. Maybe holding a
>tournament between default Chessmaster and a number of other strong engines
>(Fritz, Shredder, etc.), and then holding a second tournament between the
>proposed "better" Chessmaster settings and the same set of strong engines (as
>Kurt Utzinger is doing now).

I don't see a better way to find out than the contests that people are running.

Once the error bars say that one cluster of settings is better, then we can
believe it.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.