Computer Chess Club Archives


Search

Terms

Messages

Subject: The Rise and Fall of SSDF

Author: Ratko V Tomic

Date: 07:50:11 09/23/99

Go up one level in this thread


>> The fact that it is at the top of SSDF just shows
>> how much they have [deteriorated] in their methodology of testing.
>> (That list used to be quite useful 5-6 years ago.)
>
> Justification for your last statement would be interesting indeed.
>

I had used chess programs and dedicated units since early 1980s. Since
finding SSDF in the ICCA Journal, I consulted it regularly and it
matched fairly well the percieved strength of the programs in human
games (against myself, my brother and a few friends, ranging in rating
from 1900-2300 ELO; my rating was somewhere in the middle). Several
years ago, as the fast PC's spurred the PC program market, the SSDF somehow
lost it. Their list doesn't match any more the perceived strength from
the human play.

I could only speculate why is that so. One reason may be that they have
overly mechanized their testing, common sense and human intelligence are
not in charge any more and you get series of repeated games, killer books,
and any imaginable form of cheating by the financially motivated
manufacturers.

I think that the automated machine vs machine play gives good estimate of
the relative strengths of the two versions of the same program, or a two
very similar programs. But as the programs became more diverse, the purely
numeric SSDF results have become a measure of the relative cleverness of
various gimmicks devised by the manufacturers specifically for that type
of mindless testing, all of which has little bearing on the actual
strength delivered to the end user. A simple poll or interviews among
users and beta testers might have more value in estimating actual
strengths of the programs than such mindless loophole prone procedure.

The SSDF folks need to rethink thoroughly their procedure to became
more relevant to the users of the chess programs (if that is their
purpose at all). To make their figures user oriented, a single
dimensional rating, even if the testing procedure were made loophole free,
may not be the way to do it in any case. Programs should be evaluated
for various aspects of play, possibly on well crafted sets of tactical
and strategic positions in various phases of the game (new ones done
from scratch in each test cycle, of course, just as the tests in schools
are not repetition of the same questions year after year). The so called
standard test positions are really meant for research community, where
cheating (at least the cheating using cheap tricks) is not an issue,
since the published algorithms are available to everyone and the results
must be preproducable. Such fixed, publicly known tests are obviously
not suitable for measuring commercially motivated, proprietary software
(with its inner workings available only to the manufacturer). If there
is buck to be made, and there are loopholes, they will be used.

The results of such tests would be multiple lists, each ranking the
programs on different aspect of play. An empirically developed weighing
could then be used to correlate such arrays of numbers with the human
and computer play ratings.  Only one of such lists may be the rating from
the computer-computer play (with some loopholes plugged and more even
hardware, e.g. they at least ought to swap hardware and testers between
the contestants, so that each plays equal number of games on each
hardware and each tester; also the proprietary autoplayer by one of the
contestants should be banished -- if someone won't supply generic/public
autoplaying capability, they should be put at the bottom of the comp-comp
list with a note that manufacturer refused to have the program tested in
direct play against other programs, that is telling enough). Having
available such multiple lists of figures, users can weigh as they wish
the aspects that matter to them.

Instead of taking a defensive posture, when someone points out
potential loopholes and flaws, the SSDF folks ought to welcome
such critique and fix the problems instead of spending their efforts
into weaving rationalizations for the flaws based on the most benevolent
and naive assumptions about the contestants (they need to look at the
contestants the same way an old teacher looks at the students during
an important test, knowing that they'll cheat given the slightest opening,
and even without any, they'll make a new one and cheat anyway). SSDF seems
to be more afraid of offending the CM or Fritz manufacturer (by putting
them to the bottom of the list with a note about refusal to be fairly
tested against the other programs) than to dupe the hundreds of thousands
of the list readers around the world, as if their fundamental purpose
has shifted from serving the users of the programs to being a promotional
tool for the largest chess program manufacturers.




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.