Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: Solution to experiment!

Author: Dann Corbit

Date: 11:10:11 08/24/01

On August 24, 2001 at 12:48:59, José de Jesús García Ruvalcaba wrote:

>On August 24, 2001 at 12:36:18, Jeff Lischer wrote:
>
>>>
>>>Hi Uri,
>>>plese try the following experiment with elostat.
>>>1. Players A, B, and C play each other, with the following individual results:
>>>A beats B 99.5 to 0.5
>>>B beats C 99.5 to 0.5
>>>A beats C 100 to 0
>>>Which ratings do you get for A, B and C using Elostat?
>>>
>>>2. The same players, but with the following results:
>>>A beats B 99.5 to 0.5
>>>B beats C 99.5 to 0.5
>>>Same question as for part 1.
>>>
>>>If the program behaves correctly, the rating of A for part 1 should not be lower
>>>as the rating of A for part 2.
>>>José.
>>
>>Excellent question! Although one can't perform your experiment with ELOStat
>>directly (because it only reads in PGN files), I can run it with code I have
>>written simulating ELOStat. If I assume an average rating of 2000:
>>
>>ELOStat Results:
>>  Case 1. A = 2920, B = 2000, C = 1080
>>  Case 2. A = 2694, B = 2000, C = 1306
>>
>>This is a problem I've known about with ELOStat. The problem comes from ELOStat
>>using the "average opponent" approach, which isn't strictly accurate because of
>>the non-linearity of the Elo formula. (Example: If I am rated 2000 and I play
>>someone rated 2400, I should score about 9%. If I play 2 people one 2000 and the
>>other 2800, I  should score about 25%.)
>>
>>I have written a modified code that uses a "sum over opponents" approach (the
>>idea was suggested to me by Walter Koroljow) to take care of this problem.
>>Rather than using an average opponent rating, this method sums over all a
>>player's opponents and calculates the expected rating of the player. With that
>>modified approach I get the following:
>>
>>Modified Method Results:
>>  Case 1. A = 2920, B = 2000, C = 1080
>>  Case 2. A = 2920, B = 2000, C = 1080
>>
>
>Thanks! I consider this correct. I assume that Elostat is a fine tool which
>works well most of the time, but which fails in some odd cases.
>
>>
>>Incidently, here's the WMCCC performance results I found using the modified
>>method using an average Elo of 2300:
>>
>>1.  Junior     2829
>>2.  Fritz      2618
>>3.  Tiger      2551
>>4.  Shredder   2545
>>5.  Crafty     2514
>>6.  Rebel      2512
>>7.  Goliath    2461
>>8.  Ferret     2429
>>9.  Gromit     2424
>>10. Gandalf    2323
>>11. ParSOS     2260
>>12. Diep       2251
>>13. IsiChess   2104
>>14. Tao        2082
>>15. Ruy Lopez  1994
>>16. Pharaon    1980
>>17. SpiderGirl 1915
>>18. XiniX      1612
>>
>>Shredder is still behind Tiger (barely), but this time ahead of Crafty and
>>Rebel.
>
>Well, Uri has a small point then. I do not think six rating points mean a lot,
>but they are there. Still, this is very different to the huge rating advantage
>Elostat gave to Tiger over Shredder.

With the number of games played, the ELO figures are nearly MEANINGLESS.

The error bars will mean that a program from amongh the weakest could really be
the strongest.

With +/- 200 ELO for each program (even with one standard deviation) you can see
how the figures could easily be shaken up.

You can calculate a TPR and all that.  But the significance of the ELO figures
is moot.

Re: Merit of Elo Ratings vs Score Jeff Lischer 14:37:56 08/24/01
- Re: Merit of Elo Ratings vs Score Dann Corbit 14:49:52 08/24/01

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.