Author: Rolf Tueschen
Date: 18:55:21 05/30/02
Go up one level in this thread
On May 27, 2002 at 02:08:01, Bertil Eklund wrote: >On May 26, 2002 at 21:05:21, Rolf Tueschen wrote: > >>On May 26, 2002 at 20:24:54, Robin Smith wrote: >> >>>On May 26, 2002 at 08:15:53, Rolf Tueschen wrote: >>> >>>>Excuse me, but you are mixing up ranking in tournament practice in a sport and >>>>testing procedures. >>>> >>>>Rolf Tueschen >>> >>>Yes. I know. But what difference does it make? Play some games. Calculate >>>ratings. Publish ratings. That one is done as "sport" and the other "testing" >>>doesn't change the fundamental method for calculating and publishing ratings. >>>The only real difference I see is that SSDF includes error bars and FIDE does >>>not. Perhaps you would like it better if SSDF didn't include the error bars? >>> >>>:-) >>> >>>Robin >> >>Yes, it would be much better, but there are still better ways to make me really >>happy with SSDF. :) >> >>Rolf Tueschen > >Yes, we all know them but you don't. > >Bertil Since I had promissed a few people to write a critical summary about SSDF ranking I started with a German version. From this article in Rolfs Mosaik (it's the number 8 there) I'll quote here the following questions. The problem is, that the critic is rather short in effect, but for most of the aspects I have no exact information that is why I wrote the nine questions for the beginning of a communication. My verdict however is already that the list has no validity. The whole presentation has a long tradition but no rational meaning. However SSDF could well make several changes and give the list a better foundation. [This is the final part of the article number 8] My translation: # Stats could only help to establish roughly correct numbers on a valid basis, but without validity the Elo numbers resemble the fata morgana that appears to those who are thirsty in the desert. [Footnote: In my first part I explained that the typical Elo numbers with 2500, 2600 or 2700 are adjusted to human players, a big pool of human players, not just 15 or 20 players! So SSDF simply has no validity at all.] # What is wrong in the SSDF stats besides the lacking validity. # To answer this we clarify what is characteristic for a chess program. # Hardware Engine Books Learning tool # What is necessary for a test experiment? Briefly - the control of these four factors/ parameters. # But at first we define, what we want to measure respectevely what should be the result. # We want to know, how successful the conmbination of Hardware, Engine, Books and Learning tool is playing. Successful play is called strength. # Here follows a list of simple questions. # 1) SSDF equips each time the new programs with the fastest hardware. Do we find out this way if the new engine is stronger than the older? No! Quite simply because the old engines could be as strong or stronger on new hardware. # 2) What's a match for between a (new) program and an old program, which is weaker in all 4 factors from above? How we could find out, which factor in the new program is responsible for the difference in strength? We couldn't know! # 3) If as a result one program is 8 "Elo points" stronger, how could we know, that this is not caused by the different opponents? We couldn't know. # 4) How could we know, if the result with a difference of 8 points won't exactly turn around the rank of each two pairs of programs after some further 20 games each? We couldn't know that. # 5) SSDF is not suppressing games of a match, however is moving a match with only 5 games into the calculation of the Elo numbers and is continuing the rest of the match for the next publication. How could we know, that this effect does not influence the result of the actual edition? We couldn't know! # 6) SSDF often matches newest progs vs ancient progs. Why? Because the variability of the choice of the opponent is important for the calculation of Elo numbers? Hence Kasparov is playing against a master player of about Elo 2350? Of course not! Such nonsense is not part of human chess [as necessity of Elo numbers!]! Or is it that the lacking validity of the computer should be replaced by the play against weakest and helpless opponents? We don't know. # 7) Why SSDF is presenting a difference of ranks of 8 points as in May 2002 or earlier even of 1 point, if the margin of error is +/- 30 points and more? Is it possible to discover a difference between each programs at all? No! SSDF is presenting differences, which possibly do not exist in real because they can't be defined account of the uncertainty or unreliability of the measurement itself. So, could we believe the SSDF ranking list? No. [Not in its presented form.] # 8) SSDF is publishing only results, is implying in short commentaries what next should be tested, but details about the test design remain unknown. What are the conditions of the tests? We don't know. # 9) How many testers SSDF actually has? 10 or 20? No. I have confidential information that perhaps a handful of testers are doing the main job. Where are all the amateur testers in Sweden? We don't know. This list of questions could be continued if necessary. So, what is the meaning of the SSDF ranking list? Perhaps mere PR, because the winning program or the trio of winners could increase it's sales figures. Perhaps the programmers themselves are interested in the list. We don't know. [Actually this ranking is unable to answer our questions about strength.] [You could read my whole article (number 8) in German at http://members.aol.com/mclanecxantia/myhomepage/rolfsmosaik.html] Rolf Tueschen
This page took 0.03 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.