Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Not meaningless - just not absolute

Author: Bob Durrett

Date: 05:43:12 02/14/03

Go up one level in this thread


On February 14, 2003 at 08:35:29, Albert Silver wrote:

>On February 14, 2003 at 07:10:40, Rolf Tueschen wrote:
>
>>Just to explain some basics for new readers, I show why the whole List is
>>worthless. The rankings are by chance the way they are presented.
>>
>>Since only a few here have basic knowledge in statistics I explain the most
>>apparet things.
>>
>>We are told that for instance the two first programs are seperated by 8 points.
>>No matter Stefan get all the credits here for his first place. But is true that
>>Shredder is stronger than Fritz?
>>
>>Here I must tell you that we simply don't know it. The SSDF pretend to know it,
>>but it is NOT true. How can I say such things? Easy! Look at the deviations.
>>These numbers with + or -. We see that most programs have an expected Elo number
>>varying plus/mius of about 30 points! Note, that the Elo minus 5 is as probable
>>as the fially given Elo for the ranking!
>>
>>If you then take a look at the Elo of the opponents in the far right you can see
>>that even for the top programs the SSDF was unable to create equal conditions.
>>Also this influence by different opponents makes the 8 numbers difference at the
>>top meaningless.
>>
>>In sum we can say that the SSDF failed to show - exactly what they pretend to
>>show - the differences between the actual top programs. The SSDF presents a new
>>leader, but that is against its own results! So that the conclusion is allowed
>>that SSDF makes deliberately their own new number 1!
>
>Your comment that being number 1 in the list is not an absolute is completely
>correct. The SSDF doesn't claim it is a statistical absolute either, which is
>why they present the data: rating performance, number of games, AND the error
>margin.
>
>
>     THE SSDF RATING LIST 2003-02-13   90961 games played by  251 computers
>                                           Rating   +     -  Games   Won  Oppo
>                                           ------  ---   --- -----   ---  ----
>   1 Shredder 7.0  256MB Athlon 1200 MHz     2768   33   -31   547   72%  2606
>   2 Deep Fritz 7.0  256MB Athlon 1200 MHz   2760   29   -28   654   70%  2612
>   3 Fritz 7.0 256MB Athlon 1200 MHz         2740   30   -29   574   64%  2635
>   4 Chess Tiger 15.0  256MB Athlon 1200 MHz 2726   27   -26   704   64%  2623
>
>
>If they present the error margin, doesn't this *clearly* mean that the result
>may be off by that much? However, so far the current performance is 2768 SSDF
>points. How many games does a human play to get their rating? I won't event
>mention the ridiculously low requirement by FIDE to play only 9 games to get a
>first rating. Suppose I had no rating and played 100 games against a 2000 Elo
>player and I scored 75/100. My performance is 2200 exactly. Is it absolute? No,
>there is a good margin of error, yet no one will question the rating and start
>telling me I'm not rated 2200, I'm just rated anywhere between 2140 and 2260. I
>see no difference. They had Shredder 7 play 547 games against other programs,
>and presented the results PLUS the error margin. It *may* still be a fraction
>weaker than Deep Fritz 7, but already it is clear that it performas better than
>Chess Tiger 15 against other computers. But even if another 200 games changed
>the top ratings to Shredder 7 = 2762 and DF7 = 2763 would anyone be so foolish
>as to claim one program is actually any stronger?? I certainly would never think
>of an opponent rated 10 points more as stronger. The fact that two such
>different playing styles achieve almost identical performances shows how rich
>and flexible chess is.
>
>                                         Albert


Excellent points.  The "bottom line" is that SSDF presented their findings
properly, but the problem is in interpretation.  SSDF cannot be held responsible
for errors in interpretation.

Bob D.

>
>>
>>(Note please that this is not a political speech, however it is what statistics
>>demands. The SSDF got this critic so often in the past but they still did't
>>change their experimental setting.)
>>
>>Rolf Tueschen



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.