Author: Greg Simpson
Date: 12:37:36 09/06/05
Go up one level in this thread
On September 05, 2005 at 04:21:08, Fabien Letouzey wrote:
>On September 04, 2005 at 04:29:33, Greg Simpson wrote:
>
>>Likelihood of superiority:
>> To Fr Fr Th Ch Ju Sh Hi Sp Ga Ru Li To Ar
>>Toga II 1.0 69 81 88 89 95 99 99 98 98 99 99 97 99
>>Fritz 8 30 53 68 69 81 91 93 92 92 96 97 93 99
>>Fruit 2.1 18 46 71 71 85 95 96 94 94 98 99 96 99
>>The King 3.33 11 31 28 50 65 80 84 85 85 90 93 87 96
>>Chess Tiger 15.0 10 30 28 49 65 80 84 85 85 90 93 87 96
>>Junior 9 4 18 14 34 34 67 72 75 75 81 85 80 92
>>Shredder 9 0 8 4 19 19 32 56 61 61 68 73 69 83
>>Hiarcs 9 0 6 3 15 15 27 43 56 56 62 67 65 79
>>Spike 1.0 Mainz 1 7 5 14 14 24 38 43 50 55 59 59 71
>>Gandalf 6.0 1 7 5 14 14 24 38 43 49 54 59 59 71
>>Ruffian 2.1.0 0 3 1 9 9 18 31 37 44 45 55 56 69
>>List 512 0 2 0 6 6 14 26 32 40 40 44 52 64
>>Toga II 0.93 2 6 3 12 12 19 30 34 40 40 43 47 58
>>Aristarch 4.50 0 0 0 3 3 7 16 20 28 28 30 35 41
>>
>>I'm not too certain how much to trust the second one.
>
>Hi,
>
>Here's what I do.
>
>You're interested in knowing whether Toga is actually stronger than Fruit,
>right? Lookup the corresponding number in the "los" table, that's 81%.
>
>It's up to you which value is big enough for "statistical proof". I use 95. In
>this case it means there is not enough data to conclude that Toga is stronger.
>Maybe it is (in that case more games would tell), maybe it isn't, we still don't
>know.
>
>Important side node: you need to decide which "los" entry you're going to read
>BEFORE looking at the table. Scanning the table and finding a large number
>won't do, as this would violate the statistical assumptions. That's the same
>problem as entering 10 engine versions in a tournament and only looking at the
>one with best result.
>
>Fabien.
I do understand confidence statistics, and I agree the above results are not
strong enough to establish a difference between the Fruit and Toga. It doesn't
even convince me that Toga is better than Shredder: other results make me think
that the Shredder results are a huge statistical outlier.
I was saying I didn't know how much to trust the los table because the results
of some tests just don't feel right to me. For instance, consider the following
small fake tournament:
Ratings
Rank Name Elo + - games score draws
1 Gerg Nospmis 1141 319 176 4 75% 0%
2 Greg Simpson 858 176 319 4 25% 0%
Likelihood of superiority
Ge Gr
Gerg Nospmis 91
Greg Simpson 8
First of all, 91% for a 3-1 result seems too high to me. If I believe it
though, I would expect another tournament with the same result to roughly
multiply the 8% chance to get a 0.64% chance that Greg is better. Instead, if I
just duplicate all the games I get:
Ratings
Rank Name Elo + - games score draws
1 Gerg Nospmis 1141 189 126 8 75% 0%
2 Greg Simpson 858 126 189 8 25% 0%
Likelihood of superiority
Ge Gr
Gerg Nospmis 97
Greg Simpson 2
I don't have enough confidence in my own intuition to say this is definitely
wrong, but I'm uncomfortable enough that I didn't want to just post the table
and leave the impression that I had full confidence in it. I obviously wasn't
clear enough about what I was uncertain of.
Actually, my own testing at 2'/40 on an Athlon 1700+ stands like this:
Ratings
Rank Name Elo + - games score draws
1 Toga II 1.0 2063 25 24 390 60% 38%
2 Toga II 1.0 Beta2 2032 16 16 931 53% 39%
3 Fruit 2.1 2019 14 13 1230 48% 39%
4 Spike 1.0a Mainz 1969 45 46 112 41% 36%
5 Ruffian 1.0.5 1914 30 31 283 33% 27%
Likelihood of superiority
To To Fr Sp Ru
Toga II 1.0 94 99 99 99
Toga II 1.0 Beta2 5 89 97 99
Fruit 2.1 0 10 94 99
Spike 1.0a Mainz 0 2 5 94
Ruffian 1.0.5 0 0 0 5
Given the similarities of Toga and Fruit I wouldn't expect the standings of the
two programs to change at longer time controls, although the ratings difference
would probably shrink.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.