Author: Ratko V Tomic
Date: 18:10:46 09/24/99
Go up one level in this thread
>> >>The SSDF folks need to rethink thoroughly their procedure to became >>more relevant to the users of the chess programs (if that is their >>purpose at all). To make their figures user oriented, a single >>dimensional rating, even if the testing procedure were made loophole free, >>may not be the way to do it in any case. Programs should be evaluated >>for various aspects of play, possibly on well crafted sets of tactical >>and strategic positions in various phases of the game >>... >>The results of such tests would be multiple lists, each ranking the >>programs on different aspect of play. An empirically developed weighing >>could then be used to correlate such arrays of numbers with the human >>and computer play ratings. Only one of such lists may be the rating from >>the computer-computer play > > So basically you are saying that the SSdf list is inflated? Npot exactly. > What is your estimation of the extent of the inflation, and could > you give the reasons why you believe so, based on your own > personal testing? Do your master friends for instance beat Fritz5 > repeateadly to make you believe it is not really 2500? My point is somewhat different from your interpretation. In some ways the computer ratings are inflated in some deflated. My basic point is that, even if you plug all the loopholes and flaws in the SSDF comp-comp testing, the relative and absolute ranking will still be overly sensitive to the conditions of evaluation, thus somewhat arbitrary. The program are stronger in some aspects and weaker in other than their official ratings. If you take a strong player (say a national master) and set him against a new program he may lose initially. But let him play for a week or two against the program, in the peace of his home, and then let them play the match again. The master will look much better. Then repeat, and the computer rating will keep dropping. Now, if instead of computer, you had another human player, its rating would generally not drop (it may rise or fall, but it certainly won't consistently drop as it will for the computer). Programs are still very naive in some ways and human cunning will always find the loopholes and beat any given program. It may be a bad opening line in the programs book, or a weakness in some type of positions (typically in a blocked position), or general greed and lack of common sense. For example, a program may have a better position which it could certainly win with almost no risk. Now, if you offer it a pawn or more in order to muddy the position, giving you some slight chance of win, a program will invariably go for it (provided it can't see within its depth any concrete loss). If the evaluation gives it a higher score within horizon, they will pick it. The so called "value" at the leaf node is actually not a true minimax value, it is an estimate with an error spread, but programs (or programmers) treat it as a true value (failing to estimate the error margin). So, to a program a move yielding +2 pawns score and error margin 0.5 pawns is inferior move to a move yielding score +5 pawns, plus or minus checkmate equivalent score. A human player, playing for score, will naturally pick a sure win and bypass choices which might give the opponent a counter-chance for some immediate material gain which he doesn' really need to win. You couldn't trick a human player who has a superior position and sure win in endgame, with a free pawn and turbulent position and you can trick a program any time you spot such chance. So, what I am saying, is that in trying to rank computers and humans using the same one dimensional numeric scale is no more accurate than trying to measure the length of a jellyfish. Depending how you squeeze it you can get many different numbers. It is a measuring situation very sensitive to circumstances and, if you have control of circumstances and an interest in a certain result, you will likely get it. The reason for this instability is the great disparity in the ways humans and programs play chess (as well as, but to a lesser degree, between different programs). If you have a program and play it long enough (duration depending on your chess strength), you'll be able to beat it and then keep improving your score against it. Which point in such experiment do you pick as "true" rating of the program? If you're selling the program you pick the early point, if you're selling the program of a competitor, you pick the later point. But if you're trying to help chess program users decide which program to buy (maybe SSDF isn't interested in informing potential chess program users), you ought not to do either. My suggestions is to examine what are the stable aspects, where the measurement has some meaning to the program users. Obviously, one would need to narrow down the circumstances in various ways and produce many numbers, each measuring and ranking one dimension of the program's play. The users of the programs would then have something meaningful and could decide how important are different strengths and weaknesses for their needs. A postal chess player would see a different "strength" ranking from someone preparing for a blitz tournament and both of them different than someone trying to play an enjoyable chess during vacations, or someone trying to help their kid improve their game.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.