Author: Ratko V Tomic
Date: 07:50:11 09/23/99
Go up one level in this thread
>> The fact that it is at the top of SSDF just shows >> how much they have [deteriorated] in their methodology of testing. >> (That list used to be quite useful 5-6 years ago.) > > Justification for your last statement would be interesting indeed. > I had used chess programs and dedicated units since early 1980s. Since finding SSDF in the ICCA Journal, I consulted it regularly and it matched fairly well the percieved strength of the programs in human games (against myself, my brother and a few friends, ranging in rating from 1900-2300 ELO; my rating was somewhere in the middle). Several years ago, as the fast PC's spurred the PC program market, the SSDF somehow lost it. Their list doesn't match any more the perceived strength from the human play. I could only speculate why is that so. One reason may be that they have overly mechanized their testing, common sense and human intelligence are not in charge any more and you get series of repeated games, killer books, and any imaginable form of cheating by the financially motivated manufacturers. I think that the automated machine vs machine play gives good estimate of the relative strengths of the two versions of the same program, or a two very similar programs. But as the programs became more diverse, the purely numeric SSDF results have become a measure of the relative cleverness of various gimmicks devised by the manufacturers specifically for that type of mindless testing, all of which has little bearing on the actual strength delivered to the end user. A simple poll or interviews among users and beta testers might have more value in estimating actual strengths of the programs than such mindless loophole prone procedure. The SSDF folks need to rethink thoroughly their procedure to became more relevant to the users of the chess programs (if that is their purpose at all). To make their figures user oriented, a single dimensional rating, even if the testing procedure were made loophole free, may not be the way to do it in any case. Programs should be evaluated for various aspects of play, possibly on well crafted sets of tactical and strategic positions in various phases of the game (new ones done from scratch in each test cycle, of course, just as the tests in schools are not repetition of the same questions year after year). The so called standard test positions are really meant for research community, where cheating (at least the cheating using cheap tricks) is not an issue, since the published algorithms are available to everyone and the results must be preproducable. Such fixed, publicly known tests are obviously not suitable for measuring commercially motivated, proprietary software (with its inner workings available only to the manufacturer). If there is buck to be made, and there are loopholes, they will be used. The results of such tests would be multiple lists, each ranking the programs on different aspect of play. An empirically developed weighing could then be used to correlate such arrays of numbers with the human and computer play ratings. Only one of such lists may be the rating from the computer-computer play (with some loopholes plugged and more even hardware, e.g. they at least ought to swap hardware and testers between the contestants, so that each plays equal number of games on each hardware and each tester; also the proprietary autoplayer by one of the contestants should be banished -- if someone won't supply generic/public autoplaying capability, they should be put at the bottom of the comp-comp list with a note that manufacturer refused to have the program tested in direct play against other programs, that is telling enough). Having available such multiple lists of figures, users can weigh as they wish the aspects that matter to them. Instead of taking a defensive posture, when someone points out potential loopholes and flaws, the SSDF folks ought to welcome such critique and fix the problems instead of spending their efforts into weaving rationalizations for the flaws based on the most benevolent and naive assumptions about the contestants (they need to look at the contestants the same way an old teacher looks at the students during an important test, knowing that they'll cheat given the slightest opening, and even without any, they'll make a new one and cheat anyway). SSDF seems to be more afraid of offending the CM or Fritz manufacturer (by putting them to the bottom of the list with a note about refusal to be fairly tested against the other programs) than to dupe the hundreds of thousands of the list readers around the world, as if their fundamental purpose has shifted from serving the users of the programs to being a promotional tool for the largest chess program manufacturers.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.