Author: Dann Corbit
Date: 11:10:11 08/24/01
Go up one level in this thread
On August 24, 2001 at 12:48:59, José de Jesús García Ruvalcaba wrote: >On August 24, 2001 at 12:36:18, Jeff Lischer wrote: > >>> >>>Hi Uri, >>>plese try the following experiment with elostat. >>>1. Players A, B, and C play each other, with the following individual results: >>>A beats B 99.5 to 0.5 >>>B beats C 99.5 to 0.5 >>>A beats C 100 to 0 >>>Which ratings do you get for A, B and C using Elostat? >>> >>>2. The same players, but with the following results: >>>A beats B 99.5 to 0.5 >>>B beats C 99.5 to 0.5 >>>Same question as for part 1. >>> >>>If the program behaves correctly, the rating of A for part 1 should not be lower >>>as the rating of A for part 2. >>>José. >> >>Excellent question! Although one can't perform your experiment with ELOStat >>directly (because it only reads in PGN files), I can run it with code I have >>written simulating ELOStat. If I assume an average rating of 2000: >> >>ELOStat Results: >> Case 1. A = 2920, B = 2000, C = 1080 >> Case 2. A = 2694, B = 2000, C = 1306 >> >>This is a problem I've known about with ELOStat. The problem comes from ELOStat >>using the "average opponent" approach, which isn't strictly accurate because of >>the non-linearity of the Elo formula. (Example: If I am rated 2000 and I play >>someone rated 2400, I should score about 9%. If I play 2 people one 2000 and the >>other 2800, I should score about 25%.) >> >>I have written a modified code that uses a "sum over opponents" approach (the >>idea was suggested to me by Walter Koroljow) to take care of this problem. >>Rather than using an average opponent rating, this method sums over all a >>player's opponents and calculates the expected rating of the player. With that >>modified approach I get the following: >> >>Modified Method Results: >> Case 1. A = 2920, B = 2000, C = 1080 >> Case 2. A = 2920, B = 2000, C = 1080 >> > >Thanks! I consider this correct. I assume that Elostat is a fine tool which >works well most of the time, but which fails in some odd cases. > >> >>Incidently, here's the WMCCC performance results I found using the modified >>method using an average Elo of 2300: >> >>1. Junior 2829 >>2. Fritz 2618 >>3. Tiger 2551 >>4. Shredder 2545 >>5. Crafty 2514 >>6. Rebel 2512 >>7. Goliath 2461 >>8. Ferret 2429 >>9. Gromit 2424 >>10. Gandalf 2323 >>11. ParSOS 2260 >>12. Diep 2251 >>13. IsiChess 2104 >>14. Tao 2082 >>15. Ruy Lopez 1994 >>16. Pharaon 1980 >>17. SpiderGirl 1915 >>18. XiniX 1612 >> >>Shredder is still behind Tiger (barely), but this time ahead of Crafty and >>Rebel. > >Well, Uri has a small point then. I do not think six rating points mean a lot, >but they are there. Still, this is very different to the huge rating advantage >Elostat gave to Tiger over Shredder. With the number of games played, the ELO figures are nearly MEANINGLESS. The error bars will mean that a program from amongh the weakest could really be the strongest. With +/- 200 ELO for each program (even with one standard deviation) you can see how the figures could easily be shaken up. You can calculate a TPR and all that. But the significance of the ELO figures is moot.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.