Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Solution to experiment!

Author: Jeff Lischer
Date: 09:36:18 08/24/01
On August 24, 2001 at 10:34:01, José de Jesús García Ruvalcaba wrote:

>On August 24, 2001 at 10:16:48, Uri Blass wrote:
>
>>On August 24, 2001 at 10:06:32, Miguel A. Ballicora wrote:
>>
>>>On August 24, 2001 at 07:51:16, Uri Blass wrote:
>>>
>>>>On August 24, 2001 at 07:29:21, Günther Simon wrote:
>>>>
>>>>>On August 24, 2001 at 07:15:30, Uri Blass wrote:
>>>>>
>>>>>>On August 24, 2001 at 07:06:51, Uri Blass wrote:
>>>>>>
>>>>>>>Here are the results by
>>>>>>>elostat program
>>>>>>>
>>>>>>>You can see that shredder is only 3th place micro based on the performance.
>>>>>>>Shredder is the world Micro champion by definition but Tiger and Rebel had a
>>>>>>>better performance.
>>>>>>>
>>>>>>>
>>>>>>>1 Deep Junior 7                  : 2745  228 281     9    88.9 %   2384   22.2 %
>>>>>>>2 Quest (DeepFritz)              : 2550  266 169     9    66.7 %   2430   44.4 %
>>>>>>>3 Chess Tiger 14.6 Gambit Tiger  : 2499  291 229     9    55.6 %   2461   22.2 %
>>>>>>>4 Crafty 18.10X                  : 2467  291 165     9    55.6 %   2428   44.4 %
>>>>>>>5 Rebel                          : 2466  291 229     9    55.6 %   2428   22.2 %
>>>>>>>6 Shredder                       : 2466  266 249     9    66.7 %   2346   22.2 %
>>>>>>>7 Goliath                        : 2421  291 165     9    55.6 %   2382   44.4 %
>>>>>>>8 Gromit 3.9.5                   : 2364  278 201     9    61.1 %   2285   33.3 %
>>>>>>>9 Ferret                         : 2359  291 229     9    55.6 %   2320   22.2
>>>>>>>%10 Gandalf 5.0                   : 2310  291 229     9    55.6 %   2271   22.2
>>>>>>>%
>>>>>>>11 ParSOS                        : 2256  291 229     9    55.6 %   2217   22.2 %
>>>>>>>12 Diep                          : 2227  165 291     9    44.4 %   2265   44.4 %
>>>>>>>13 IsiChess X                    : 2166  201 278     9    38.9 %   2245   33.3 %
>>>>>>>14 Tao                           : 2165  229 291     9    44.4 %   2203   22.2 %
>>>>>>>15 Ruy Lopez                     : 2118  366 266     9    33.3 %   2238    0.0 %
>>>>>>>16 Pharaon                       : 2082  169 266     9    33.3 %   2202   44.4 %
>>>>>>>17 SpiderGirl                    : 2014  213 255     9    27.8 %   2180   33.3 %
>>>>>>>18 XiNiX                         : 1724  400 108     9     5.6 %   2216   11.1 %
>>>>>>>
>>>>>>>congratulation also for the Deep Junior team for winning the event convincingly
>>>>>>>when the difference from the second place is almost 200 elo and the hardware
>>>>>>>explain less than 70 elo difference.
>>>>>>>
>>>>>>>Uri
>>>>>>
>>>>>>I can add that I think that it may be a better idea to use elostat to decide
>>>>>>about the world champion in the future.
>>>>>>
>>>>>>I know that a lot of people are going to disagree but it is my opinion.
>>>>>>I prefer a complicated method that does more justive and not a simple method.
>>>>>>
>>>>>>Uri
>>>>>
>>>>>
>>>>>Sorry Uri - but this is really nonsens.
>>>>>You cant use ELO-Stat on a Swiss Tournament with 9 rounds as
>>>>>it is described by the author. ELO-Stat is designed to calculate
>>>>>ratings out of a pool of unknown rated progs with a very very lot
>>>>>of games.
>>>>>Therefor if you take a closer look at your table you would see that
>>>>>the error margin is at least 435!pts (Pharaon) and max 632!! (RuyLopez).
>>>>>And would you really believe Parallel SOS to be at 2256? :))
>>>>
>>>>The question is not which program is better.
>>>>competitions of 9 rounds are not supposed to answer this question.
>>>>
>>>>The question is which program did better result.
>>>>The elostat answer this question better than the ranking
>>>
>>>You forget the tournament strategy. Many times, you can adjust the contempt
>>>because you know that a draw is extremely convenient or will give you the
>>>title right away. Not to mention the selection of more or less agressive opening
>>>books for a special round. Sometimes, a draw is the same as a loss and you risk.
>>>That throws away any significance of a performance ELO in a 9 round tournament.
>>>This also applies for any human tournament.
>>>
>>>You can also have the weird situation where you got 8.5/9 and the one with 8/9
>>>has a better elo performance. They drew each other but a couple of opponents
>>>that play the 1st started to crash many games aftewards because of late minute
>>>changes in the code etc. That was totally out of control of the winner.
>>
>>
>>I think that it is not logical
>>If you get 8.5/9 your results are not worse than a player who got 8/9 and drew
>>against you.
>>
>>We look for a stable rating
>>Suppose that you got 8.5/9
>>Suppose that the rating of the player you drew is better than your rating
>>I can prove that your rating is not stable and is going to get bigger after the
>>tournament.
>>
>>you do not lose rating from winning 8 games and the rating of the opponents is
>>not important.
>>you win rating from drawing one game against a player with better elo rating so
>>the total result is that you earn rating.
>>
>>
>>If the elostat let situation when 8/9 is better than 8.5/9 including a draw
>>between the 2 best players then something is wrong with the elostat program.
>>
>>Uri
>
>Hi Uri,
>plese try the following experiment with elostat.
>1. Players A, B, and C play each other, with the following individual results:
>A beats B 99.5 to 0.5
>B beats C 99.5 to 0.5
>A beats C 100 to 0
>Which ratings do you get for A, B and C using Elostat?
>
>2. The same players, but with the following results:
>A beats B 99.5 to 0.5
>B beats C 99.5 to 0.5
>Same question as for part 1.
>
>If the program behaves correctly, the rating of A for part 1 should not be lower
>as the rating of A for part 2.
>José.

Excellent question! Although one can't perform your experiment with ELOStat
directly (because it only reads in PGN files), I can run it with code I have
written simulating ELOStat. If I assume an average rating of 2000:

ELOStat Results:
  Case 1. A = 2920, B = 2000, C = 1080
  Case 2. A = 2694, B = 2000, C = 1306

This is a problem I've known about with ELOStat. The problem comes from ELOStat
using the "average opponent" approach, which isn't strictly accurate because of
the non-linearity of the Elo formula. (Example: If I am rated 2000 and I play
someone rated 2400, I should score about 9%. If I play 2 people one 2000 and the
other 2800, I  should score about 25%.)

I have written a modified code that uses a "sum over opponents" approach (the
idea was suggested to me by Walter Koroljow) to take care of this problem.
Rather than using an average opponent rating, this method sums over all a
player's opponents and calculates the expected rating of the player. With that
modified approach I get the following:

Modified Method Results:
  Case 1. A = 2920, B = 2000, C = 1080
  Case 2. A = 2920, B = 2000, C = 1080


Incidently, here's the WMCCC performance results I found using the modified
method using an average Elo of 2300:

1.  Junior     2829
2.  Fritz      2618
3.  Tiger      2551
4.  Shredder   2545
5.  Crafty     2514
6.  Rebel      2512
7.  Goliath    2461
8.  Ferret     2429
9.  Gromit     2424
10. Gandalf    2323
11. ParSOS     2260
12. Diep       2251
13. IsiChess   2104
14. Tao        2082
15. Ruy Lopez  1994
16. Pharaon    1980
17. SpiderGirl 1915
18. XiniX      1612

Shredder is still behind Tiger (barely), but this time ahead of Crafty and
Rebel.
Re: Solution to experiment! Miguel A. Ballicora 11:08:39 08/24/01
Re: Solution to experiment! José de Jesús García Ruvalcaba 09:48:59 08/24/01
- Re: Solution to experiment! Dann Corbit 11:10:11 08/24/01
  - Re: Merit of Elo Ratings vs Score Jeff Lischer 14:37:56 08/24/01
    - Re: Merit of Elo Ratings vs Score Dann Corbit 14:49:52 08/24/01
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.