Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Toga II 1.0 Test (40'/40) After 550 games / 11 Matches

Author: Greg Simpson
Date: 12:37:36 09/06/05
On September 05, 2005 at 04:21:08, Fabien Letouzey wrote:

>On September 04, 2005 at 04:29:33, Greg Simpson wrote:
>
>>Likelihood of superiority:
>>                  To Fr Fr Th Ch Ju Sh Hi Sp Ga Ru Li To Ar
>>Toga II 1.0          69 81 88 89 95 99 99 98 98 99 99 97 99
>>Fritz 8           30    53 68 69 81 91 93 92 92 96 97 93 99
>>Fruit 2.1         18 46    71 71 85 95 96 94 94 98 99 96 99
>>The King 3.33     11 31 28    50 65 80 84 85 85 90 93 87 96
>>Chess Tiger 15.0  10 30 28 49    65 80 84 85 85 90 93 87 96
>>Junior 9           4 18 14 34 34    67 72 75 75 81 85 80 92
>>Shredder 9         0  8  4 19 19 32    56 61 61 68 73 69 83
>>Hiarcs 9           0  6  3 15 15 27 43    56 56 62 67 65 79
>>Spike 1.0 Mainz    1  7  5 14 14 24 38 43    50 55 59 59 71
>>Gandalf 6.0        1  7  5 14 14 24 38 43 49    54 59 59 71
>>Ruffian 2.1.0      0  3  1  9  9 18 31 37 44 45    55 56 69
>>List 512           0  2  0  6  6 14 26 32 40 40 44    52 64
>>Toga II 0.93       2  6  3 12 12 19 30 34 40 40 43 47    58
>>Aristarch 4.50     0  0  0  3  3  7 16 20 28 28 30 35 41
>>
>>I'm not too certain how much to trust the second one.
>
>Hi,
>
>Here's what I do.
>
>You're interested in knowing whether Toga is actually stronger than Fruit,
>right?  Lookup the corresponding number in the "los" table, that's 81%.
>
>It's up to you which value is big enough for "statistical proof".  I use 95.  In
>this case it means there is not enough data to conclude that Toga is stronger.
>Maybe it is (in that case more games would tell), maybe it isn't, we still don't
>know.
>
>Important side node: you need to decide which "los" entry you're going to read
>BEFORE looking at the table.  Scanning the table and finding a large number
>won't do, as this would violate the statistical assumptions.  That's the same
>problem as entering 10 engine versions in a tournament and only looking at the
>one with best result.
>
>Fabien.

I do understand confidence statistics, and I agree the above results are not
strong enough to establish a difference between the Fruit and Toga.  It doesn't
even convince me that Toga is better than Shredder: other results make me think
that the Shredder results are a huge statistical outlier.

I was saying I didn't know how much to trust the los table because the results
of some tests just don't feel right to me.  For instance, consider the following
small fake tournament:

Ratings
Rank Name           Elo    +    - games score draws
   1 Gerg Nospmis  1141  319  176     4   75%    0%
   2 Greg Simpson   858  176  319     4   25%    0%

Likelihood of superiority
              Ge Gr
Gerg Nospmis     91
Greg Simpson   8

First of all, 91% for a 3-1 result seems too high to me.  If I believe it
though, I would expect another tournament with the same result to roughly
multiply the 8% chance to get a 0.64% chance that Greg is better.  Instead, if I
just duplicate all the games I get:

Ratings
Rank Name           Elo    +    - games score draws
   1 Gerg Nospmis  1141  189  126     8   75%    0%
   2 Greg Simpson   858  126  189     8   25%    0%

Likelihood of superiority
              Ge Gr
Gerg Nospmis     97
Greg Simpson   2

I don't have enough confidence in my own intuition to say this is definitely
wrong, but I'm uncomfortable enough that I didn't want to just post the table
and leave the impression that I had full confidence in it.  I obviously wasn't
clear enough about what I was uncertain of.

Actually, my own testing at 2'/40 on an Athlon 1700+ stands like this:

Ratings
Rank Name                Elo    +    - games score draws
   1 Toga II 1.0        2063   25   24   390   60%   38%
   2 Toga II 1.0 Beta2  2032   16   16   931   53%   39%
   3 Fruit 2.1          2019   14   13  1230   48%   39%
   4 Spike 1.0a Mainz   1969   45   46   112   41%   36%
   5 Ruffian 1.0.5      1914   30   31   283   33%   27%

Likelihood of superiority
                   To To Fr Sp Ru
Toga II 1.0           94 99 99 99
Toga II 1.0 Beta2   5    89 97 99
Fruit 2.1           0 10    94 99
Spike 1.0a Mainz    0  2  5    94
Ruffian 1.0.5       0  0  0  5

Given the similarities of Toga and Fruit I wouldn't expect the standings of the
two programs to change at longer time controls, although the ratings difference
would probably shrink.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.