Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: The Rise and Fall of SSDF

Author: Ratko V Tomic

Date: 18:10:46 09/24/99

Go up one level in this thread


>>
>>The SSDF folks need to rethink thoroughly their procedure to became
>>more relevant to the users of the chess programs (if that is their
>>purpose at all). To make their figures user oriented, a single
>>dimensional rating, even if the testing procedure were made loophole free,
>>may not be the way to do it in any case. Programs should be evaluated
>>for various aspects of play, possibly on well crafted sets of tactical
>>and strategic positions in various phases of the game
>>...
>>The results of such tests would be multiple lists, each ranking the
>>programs on different aspect of play. An empirically developed weighing
>>could then be used to correlate such arrays of numbers with the human
>>and computer play ratings.  Only one of such lists may be the rating from
>>the computer-computer play
>
>  So basically you are saying that the SSdf list is inflated?

Npot exactly.

> What is your estimation of the extent of the inflation, and could
> you give the reasons  why you believe so, based on your own
> personal testing? Do your master friends for instance beat Fritz5
> repeateadly to make you believe it is not really 2500?

My point is somewhat different from your interpretation. In some ways the
computer ratings are inflated in some deflated. My basic point is that, even if
you plug all the loopholes and flaws in the SSDF comp-comp testing, the relative
and absolute ranking will still be overly sensitive to the conditions of
evaluation, thus somewhat arbitrary. The program are stronger in some aspects
and weaker in other than their official ratings. If you take a strong player
(say a national master) and set him against a new program he may lose initially.
But let him play for a week  or two against the program, in the peace of his
home, and then let them play the match again. The master will look much better.
Then repeat, and the computer rating will keep dropping. Now, if instead of
computer, you had another human player, its rating would generally not drop (it
may rise or fall, but it certainly won't consistently drop as it will for the
computer).

Programs are still very naive in some ways and human cunning will always find
the loopholes and beat any given program. It may be a bad opening line in the
programs book, or a weakness in some type of positions (typically in a blocked
position), or general greed and lack of common sense. For example, a program may
have a better position which it could certainly win with almost no risk. Now, if
you offer it a pawn or more in order to muddy the position, giving you some
slight chance of win, a program will invariably go for it (provided it can't see
within its depth any concrete loss). If the evaluation gives it a higher score
within horizon, they will pick it. The so called "value" at the leaf node is
actually not a true minimax value, it is an estimate with an error spread, but
programs (or programmers) treat it as a true value (failing to estimate the
error margin). So, to a program a move yielding +2 pawns score and error margin
0.5 pawns is inferior move to a move yielding score +5 pawns, plus or minus
checkmate equivalent score. A human player, playing for score, will naturally
pick a sure win and bypass choices which might give the opponent a
counter-chance for some immediate material gain which he doesn' really need to
win. You couldn't trick a human player who has a superior position and sure win
in endgame, with a free pawn and turbulent position and you can trick a program
any time you spot such chance.

So, what I am saying, is that in trying to rank computers and humans using the
same one dimensional numeric scale is no more accurate than trying to measure
the length of a jellyfish. Depending how you squeeze it you can get many
different numbers. It is a measuring situation very sensitive to circumstances
and, if you have control of circumstances and an interest in a certain result,
you will likely get it. The reason for this instability is the great disparity
in the ways humans and programs play chess (as well as, but to a lesser degree,
between different programs). If you have a program and play it long enough
(duration depending on your chess strength), you'll be able to beat it and then
keep improving your score against it. Which point in such experiment do you pick
as "true" rating of the program? If you're selling the program you pick the
early point, if you're selling the program of a competitor, you pick the later
point. But if you're trying to help chess program users decide which program to
buy (maybe SSDF isn't interested in informing potential chess program users),
you ought not to do either.

My suggestions is to examine what are the stable aspects, where the measurement
has some meaning to the program users. Obviously, one would need to narrow down
the circumstances in various ways and produce many numbers, each measuring and
ranking one dimension of the program's play. The users of the programs would
then have something meaningful and could decide how important are different
strengths and weaknesses for their needs. A postal chess player would see a
different "strength" ranking from someone preparing for a blitz tournament and
both of them different than someone trying to play an enjoyable chess during
vacations, or someone trying to help their kid improve their game.




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.