Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Comments of latest SSDF list - Nine basic questions

Author: Rolf Tueschen

Date: 18:55:21 05/30/02

Go up one level in this thread


On May 27, 2002 at 02:08:01, Bertil Eklund wrote:

>On May 26, 2002 at 21:05:21, Rolf Tueschen wrote:
>
>>On May 26, 2002 at 20:24:54, Robin Smith wrote:
>>
>>>On May 26, 2002 at 08:15:53, Rolf Tueschen wrote:
>>>
>>>>Excuse me, but you are mixing up ranking in tournament practice in a sport and
>>>>testing procedures.
>>>>
>>>>Rolf Tueschen
>>>
>>>Yes. I know.  But what difference does it make?  Play some games.  Calculate
>>>ratings.  Publish ratings.  That one is done as "sport" and the other "testing"
>>>doesn't change the fundamental method for calculating and publishing ratings.
>>>The only real difference I see is that SSDF includes error bars and FIDE does
>>>not.  Perhaps you would like it better if SSDF didn't include the error bars?
>>>
>>>:-)
>>>
>>>Robin
>>
>>Yes, it would be much better, but there are still better ways to make me really
>>happy with SSDF. :)
>>
>>Rolf Tueschen
>
>Yes, we all know them but you don't.
>
>Bertil

Since I had promissed a few people to write a critical summary about SSDF
ranking I started with a German version. From this article in Rolfs Mosaik (it's
the number 8 there) I'll quote here the following questions. The problem is,
that the critic is rather short in effect, but for most of the aspects I have no
exact information that is why I wrote the nine questions for the beginning of a
communication. My verdict however is already that the list has no validity. The
whole presentation has a long tradition but no rational meaning. However SSDF
could well make several changes and give the list a better foundation.

[This is the final part of the article number 8]

My translation:

# Stats could only help to establish roughly correct numbers on a valid basis,
but without validity the Elo numbers resemble the
fata morgana that appears to those who are thirsty in the desert. [Footnote: In
my first part I explained that the typical Elo numbers with 2500, 2600 or 2700
are adjusted to human players, a big pool of human players, not just 15 or 20
players! So SSDF simply has no validity at all.]

# What is wrong in the SSDF stats besides the lacking validity.

# To answer this we clarify what is characteristic for a chess program.

# Hardware
  Engine
  Books
  Learning tool

# What is necessary for a test experiment?
Briefly - the control of these four factors/ parameters.

# But at first we define, what we want to measure respectevely what should be
the result.

# We want to know, how successful the conmbination of Hardware, Engine, Books
and Learning tool is playing. Successful play is called strength.

# Here follows a list of simple questions.

# 1) SSDF equips each time the new programs with the fastest hardware. Do we
find out this way if the new engine is stronger than the older? No! Quite simply
because the old engines could be as strong or stronger on new hardware.


# 2) What's a match for between a (new) program and an old program, which is
weaker in all 4 factors from above? How we could find out, which factor in the
new program is responsible for the difference in strength? We couldn't know!

# 3) If as a result one program is 8 "Elo points" stronger, how could we know,
that this is not caused by the different opponents? We couldn't know.

# 4) How could we know, if the result with a difference of 8 points won't
exactly turn around the rank of each two pairs of programs after some further 20
games each? We couldn't know that.

# 5) SSDF is not suppressing games of a match, however is moving a match with
only 5 games into the calculation of the Elo numbers and is continuing the rest
of the match for the next publication. How could we know, that this effect does
not influence the result of the actual edition? We couldn't know!

# 6) SSDF often matches newest progs vs ancient progs. Why? Because the
variability of the choice of the opponent is important for the calculation of
Elo numbers? Hence Kasparov is playing against a master player of about Elo
2350? Of course not! Such nonsense is not part of human chess [as necessity of
Elo numbers!]! Or is it that the lacking validity of the computer should be
replaced by the play against weakest and helpless opponents? We don't know.

# 7) Why SSDF is presenting a difference of ranks of 8 points as in May 2002 or
earlier even of 1 point, if the margin of error is +/- 30 points and more? Is it
possible to discover a difference between each programs at all? No! SSDF is
presenting differences, which possibly do not exist in real because they can't
be defined account of the uncertainty or unreliability of the measurement
itself. So, could we believe the SSDF ranking list? No. [Not in its presented
form.]

# 8) SSDF is publishing only results, is implying in short commentaries what
next should be tested, but details about the test design remain unknown. What
are the conditions of the tests? We don't know.

# 9) How many testers SSDF actually has? 10 or 20? No. I have confidential
information that perhaps a handful of testers are doing the main job. Where are
all the amateur testers in Sweden? We don't know.

This list of questions could be continued if necessary.

So, what is the meaning of the SSDF ranking list? Perhaps mere PR, because the
winning program or the trio of winners could increase it's sales figures.
Perhaps the programmers themselves are interested in the list. We don't know.

[Actually this ranking is unable to answer our questions about strength.]

[You could read my whole article (number 8) in German at
http://members.aol.com/mclanecxantia/myhomepage/rolfsmosaik.html]

Rolf Tueschen




This page took 0.03 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.