Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: The Rise and Fall of SSDF

Author: will smtih

Date: 14:04:32 09/23/99

Go up one level in this thread


On September 23, 1999 at 10:50:11, Ratko V Tomic wrote:

>>> The fact that it is at the top of SSDF just shows
>>> how much they have [deteriorated] in their methodology of testing.
>>> (That list used to be quite useful 5-6 years ago.)
>>
>> Justification for your last statement would be interesting indeed.
>>
>
>I had used chess programs and dedicated units since early 1980s. Since
>finding SSDF in the ICCA Journal, I consulted it regularly and it
>matched fairly well the percieved strength of the programs in human
>games (against myself, my brother and a few friends, ranging in rating
>from 1900-2300 ELO; my rating was somewhere in the middle). Several
>years ago, as the fast PC's spurred the PC program market, the SSDF somehow
>lost it. Their list doesn't match any more the perceived strength from
>the human play.
>
>I could only speculate why is that so. One reason may be that they have
>overly mechanized their testing, common sense and human intelligence are
>not in charge any more and you get series of repeated games, killer books,
>and any imaginable form of cheating by the financially motivated
>manufacturers.
>
>I think that the automated machine vs machine play gives good estimate of
>the relative strengths of the two versions of the same program, or a two
>very similar programs. But as the programs became more diverse, the purely
>numeric SSDF results have become a measure of the relative cleverness of
>various gimmicks devised by the manufacturers specifically for that type
>of mindless testing, all of which has little bearing on the actual
>strength delivered to the end user. A simple poll or interviews among
>users and beta testers might have more value in estimating actual
>strengths of the programs than such mindless loophole prone procedure.
>
>The SSDF folks need to rethink thoroughly their procedure to became
>more relevant to the users of the chess programs (if that is their
>purpose at all). To make their figures user oriented, a single
>dimensional rating, even if the testing procedure were made loophole free,
>may not be the way to do it in any case. Programs should be evaluated
>for various aspects of play, possibly on well crafted sets of tactical
>and strategic positions in various phases of the game (new ones done
>from scratch in each test cycle, of course, just as the tests in schools
>are not repetition of the same questions year after year). The so called
>standard test positions are really meant for research community, where
>cheating (at least the cheating using cheap tricks) is not an issue,
>since the published algorithms are available to everyone and the results
>must be preproducable. Such fixed, publicly known tests are obviously
>not suitable for measuring commercially motivated, proprietary software
>(with its inner workings available only to the manufacturer). If there
>is buck to be made, and there are loopholes, they will be used.
>
>The results of such tests would be multiple lists, each ranking the
>programs on different aspect of play. An empirically developed weighing
>could then be used to correlate such arrays of numbers with the human
>and computer play ratings.  Only one of such lists may be the rating from
>the computer-computer play (with some loopholes plugged and more even
>hardware, e.g. they at least ought to swap hardware and testers between
>the contestants, so that each plays equal number of games on each
>hardware and each tester; also the proprietary autoplayer by one of the
>contestants should be banished -- if someone won't supply generic/public
>autoplaying capability, they should be put at the bottom of the comp-comp
>list with a note that manufacturer refused to have the program tested in
>direct play against other programs, that is telling enough). Having
>available such multiple lists of figures, users can weigh as they wish
>the aspects that matter to them.
>
>Instead of taking a defensive posture, when someone points out
>potential loopholes and flaws, the SSDF folks ought to welcome
>such critique and fix the problems instead of spending their efforts
>into weaving rationalizations for the flaws based on the most benevolent
>and naive assumptions about the contestants (they need to look at the
>contestants the same way an old teacher looks at the students during
>an important test, knowing that they'll cheat given the slightest opening,
>and even without any, they'll make a new one and cheat anyway). SSDF seems
>to be more afraid of offending the CM or Fritz manufacturer (by putting
>them to the bottom of the list with a note about refusal to be fairly
>tested against the other programs) than to dupe the hundreds of thousands
>of the list readers around the world, as if their fundamental purpose
>has shifted from serving the users of the programs to being a promotional
>tool for the largest chess program manufacturers.




  So basically you are saying that the SSdf list is inflated? What is your
estimation of the extent of the inflation, and could you give the reasons why
you believe so, based on your own personal testing? Do your master friends for
instance beat Fritz5 repeateadly to make you believe it is not really 2500?



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.