Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: Testing who is better, rating lists, and Chessmaster

Author: Dann Corbit

Date: 00:30:03 07/09/03

On July 09, 2003 at 03:19:09, Russell Reagan wrote:

>In recent rec.games.chess.computer, a question was asked regarding what the
>absolute strongest settings for Chessmaster were. The poster said he discovered
>that the default settings were weaker by at least 97 ELO points and sited
>Surak's rating list (http://www.grailmaster.com/misc/chess/comp/cm.html).
>
>Surak's rating list contains only ratings of different Chessmaster versions and
>settings. Even with different settings, it's all the same engine, so I don't
>imagine the rating differences could realistically be almost 100 ELO points (at
>least not 100 points above what the author believes to be the strongest).
>
>To illistrate this point, I decided to play a little tournament between engines
>that had similar ratings (still in progress). Here are the results, so far:
>
>                Score         1     2     3     4     5     6     7     8
>--------------------------------------------------------------------------
> 1: Engine G  19.0 / 30   XXXXX ==10. 1===. ===1. ==11. 01==1 11==. 110==
> 2: Engine D  16.5 / 30   ==01. XXXXX ====. ===== =11== =011. ==1=. 001=.
> 3: Engine F  15.0 / 30   0===. ====. XXXXX 0==1. =1=== 1=00. 1==== ==10.
> 4: Engine E  15.0 / 30   ===0. ===== 1==0. XXXXX =000. ===== ==11. ==11.
> 5: Engine C  15.0 / 30   ==00. =00== =0=== =111. XXXXX 1===. ==10. 0=11.
> 6: Engine B  14.0 / 30   10==0 =100. 0=11. ===== 0===. XXXXX =0=1. 01==.
> 7: Engine H  13.0 / 30   00==. ==0=. 0==== ==00. ==01. =1=0. XXXXX =11==
> 8: Engine A  12.5 / 30   001== 110=. ==01. ==00. 1=00. 10==. =00== XXXXX
>--------------------------------------------------------------------------
>120 games: +30 =69 -21
>
>    Program       Elo    +   -   Games   Score   Av.Op.  Draws
>  1 Engine G    : 2582  114  74    30    63.3 %   2487   53.3 %

2582 - 74 = 2510

>  2 Engine D    : 2531  127  61    30    55.0 %   2496   63.3 %
>  3 Engine C    : 2501   90  90    30    50.0 %   2501   53.3 %
>  4 Engine E    : 2500   76  76    30    50.0 %   2500   66.7 %
>  5 Engine F    : 2499   76  76    30    50.0 %   2499   66.7 %
>  6 Engine B    : 2482   73 130    30    46.7 %   2505   53.3 %
>  7 Engine H    : 2457   65 124    30    43.3 %   2504   60.0 %
>  8 Engine A    : 2449   87 122    30    41.7 %   2508   43.3 %

2449 + 87 = 2536

>
>This indicates a rating difference of 133 ELO points. The funny thing is, every
>engine is Crafty, the exact same binary, using the exact same settings. If this
>kind of testing can produce a difference of 133 rating points between the exact
>same engine, what does that say about a mere 97 rating point difference between
>the different Chessmaster settings?

About what one would expect, depending upon the number of games that have been
played.

>This tells me that when testing an improvement to an engine, you shouldn't use
>head to head results as a good indicator of whether or not the new version is
>actaully an improvement. Thoughts?

The worst possible opponent is one that is exactly your strength.  There, the
random walk effect is multiplied.

The best possible opponents are a lot stronger or a lot weaker, but not
dominatingly.

So if you win 10% of the points in a long match or win 90% of the points in a
long match, then you have a good indication.

>Also, what does this say about the controversial issue surrounding whether or
>not the default settings of Chessmaster are indeed the strongest? It would seem
>that some other form of testing would be needed to demonstrate that, aside from
>playing a plethora of Chessmaster versions against one another. Maybe holding a
>tournament between default Chessmaster and a number of other strong engines
>(Fritz, Shredder, etc.), and then holding a second tournament between the
>proposed "better" Chessmaster settings and the same set of strong engines (as
>Kurt Utzinger is doing now).

I don't see a better way to find out than the contests that people are running.

Once the error bars say that one cluster of settings is better, then we can
believe it.

Re: Testing who is better, rating lists, and Chessmaster Ed Panek 09:44:43 07/09/03
Re: Testing who is better, rating lists, and Chessmaster Russell Reagan 01:04:54 07/09/03
- Re: Testing who is better, rating lists, and Chessmaster Slater Wold 09:41:15 07/09/03
  - Re: Testing who is better, rating lists, and Chessmaster Russell Reagan 10:22:55 07/09/03
    - Re: Testing who is better, rating lists, and Chessmaster Slater Wold 11:31:05 07/09/03
- Re: Testing who is better, rating lists, and Chessmaster Peter McKenzie 04:52:03 07/09/03

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.