Author: Greg Simpson
Date: 12:37:36 09/06/05
Go up one level in this thread
On September 05, 2005 at 04:21:08, Fabien Letouzey wrote: >On September 04, 2005 at 04:29:33, Greg Simpson wrote: > >>Likelihood of superiority: >> To Fr Fr Th Ch Ju Sh Hi Sp Ga Ru Li To Ar >>Toga II 1.0 69 81 88 89 95 99 99 98 98 99 99 97 99 >>Fritz 8 30 53 68 69 81 91 93 92 92 96 97 93 99 >>Fruit 2.1 18 46 71 71 85 95 96 94 94 98 99 96 99 >>The King 3.33 11 31 28 50 65 80 84 85 85 90 93 87 96 >>Chess Tiger 15.0 10 30 28 49 65 80 84 85 85 90 93 87 96 >>Junior 9 4 18 14 34 34 67 72 75 75 81 85 80 92 >>Shredder 9 0 8 4 19 19 32 56 61 61 68 73 69 83 >>Hiarcs 9 0 6 3 15 15 27 43 56 56 62 67 65 79 >>Spike 1.0 Mainz 1 7 5 14 14 24 38 43 50 55 59 59 71 >>Gandalf 6.0 1 7 5 14 14 24 38 43 49 54 59 59 71 >>Ruffian 2.1.0 0 3 1 9 9 18 31 37 44 45 55 56 69 >>List 512 0 2 0 6 6 14 26 32 40 40 44 52 64 >>Toga II 0.93 2 6 3 12 12 19 30 34 40 40 43 47 58 >>Aristarch 4.50 0 0 0 3 3 7 16 20 28 28 30 35 41 >> >>I'm not too certain how much to trust the second one. > >Hi, > >Here's what I do. > >You're interested in knowing whether Toga is actually stronger than Fruit, >right? Lookup the corresponding number in the "los" table, that's 81%. > >It's up to you which value is big enough for "statistical proof". I use 95. In >this case it means there is not enough data to conclude that Toga is stronger. >Maybe it is (in that case more games would tell), maybe it isn't, we still don't >know. > >Important side node: you need to decide which "los" entry you're going to read >BEFORE looking at the table. Scanning the table and finding a large number >won't do, as this would violate the statistical assumptions. That's the same >problem as entering 10 engine versions in a tournament and only looking at the >one with best result. > >Fabien. I do understand confidence statistics, and I agree the above results are not strong enough to establish a difference between the Fruit and Toga. It doesn't even convince me that Toga is better than Shredder: other results make me think that the Shredder results are a huge statistical outlier. I was saying I didn't know how much to trust the los table because the results of some tests just don't feel right to me. For instance, consider the following small fake tournament: Ratings Rank Name Elo + - games score draws 1 Gerg Nospmis 1141 319 176 4 75% 0% 2 Greg Simpson 858 176 319 4 25% 0% Likelihood of superiority Ge Gr Gerg Nospmis 91 Greg Simpson 8 First of all, 91% for a 3-1 result seems too high to me. If I believe it though, I would expect another tournament with the same result to roughly multiply the 8% chance to get a 0.64% chance that Greg is better. Instead, if I just duplicate all the games I get: Ratings Rank Name Elo + - games score draws 1 Gerg Nospmis 1141 189 126 8 75% 0% 2 Greg Simpson 858 126 189 8 25% 0% Likelihood of superiority Ge Gr Gerg Nospmis 97 Greg Simpson 2 I don't have enough confidence in my own intuition to say this is definitely wrong, but I'm uncomfortable enough that I didn't want to just post the table and leave the impression that I had full confidence in it. I obviously wasn't clear enough about what I was uncertain of. Actually, my own testing at 2'/40 on an Athlon 1700+ stands like this: Ratings Rank Name Elo + - games score draws 1 Toga II 1.0 2063 25 24 390 60% 38% 2 Toga II 1.0 Beta2 2032 16 16 931 53% 39% 3 Fruit 2.1 2019 14 13 1230 48% 39% 4 Spike 1.0a Mainz 1969 45 46 112 41% 36% 5 Ruffian 1.0.5 1914 30 31 283 33% 27% Likelihood of superiority To To Fr Sp Ru Toga II 1.0 94 99 99 99 Toga II 1.0 Beta2 5 89 97 99 Fruit 2.1 0 10 94 99 Spike 1.0a Mainz 0 2 5 94 Ruffian 1.0.5 0 0 0 5 Given the similarities of Toga and Fruit I wouldn't expect the standings of the two programs to change at longer time controls, although the ratings difference would probably shrink.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.