Author: Stephen A. Boak
Date: 01:21:56 12/11/99
Go up one level in this thread
>On December 10, 1999 at 11:25:58, Charles Unruh wrote: >I wonder is it possible that Once we get 20 games of GM's against Rebel at 40/2 >(there also seems to be a good number of master and GM games vs Junior)that we >can then recalculate/readjust the ratings of the ssdf by first calculating the >ratings of other comps vs Rebel(with the calculated rating from it's 20 games.). I suggested the same possibility, in a post a few weeks ago. There are multiple conceptual problems with this approach, as I pointed out then. Recalibrating the SSDF ratings based on a single, recently FIDE rated program, such as Rebel, would have to overcome the following objections: 1. The individual rating of each SSDF rated computer program is not known with respect to play against FIDE rated humans. 2. There is no simple math formula to know how well a computer program will play against humans, based on its SSDF rating. One could devise a formula, based in part on the SSDF ratings, but one could not test such a formula for Standard Error of Estimate (SEE), or Standard Error of Forecast, without playing many, many 40/2 games between many of those programs and strong FIDE-rated players. And that formula might not work for programs not so tested during the development and checkout of the formula. 3. It is not obvious that the relative rating of computer programs in comp-comp play will hold for the same programs when they play humans at 40/2 time controls. A computer program that can out think (let's say play tactically better) other computer programs will have a higher comp-comp rating. That same program *may* have strategic and positional weaknesses that are worse than the strategic and positional flaws of the programs it beats using its tactical advantages--which weaknesses are not taken advantage of by those weaker programs. By contrast, reasonably strong human players might exploit those strategic and positional weaknesses more readily against the strongest comp-comp program than against its lesser rated (comp-comp) competitors. A program is a combination of strategic, positional and tactical abilities. Since most programs are weak in the strategic planning aspect, they establish their comp-comp ratings based more on their shorter term positional and tactical abilities. Their strategic weaknesses are not strongly reflected in their relative comp-comp ratings, since those ratings emphasize their tactical skills. When those programs play relatively strong FIDE-rated humans, they will likely receive FIDE-type ratings that more closely reflect their relative strategic skills (and weaknesses) because the humans will seek to beat the computers by attacking their known weaknesses. 4. Therefore, even if the FIDE rating of one program (example, REBEL) is known, the relative rating spread among the many SSDF-rated computer programs is not known with respect to their performance against FIDE rated humans. The relative SSDF ratings of the various programs might actually be significantly different, after those programs played lots of 40/2 games with strong human players. I enjoy speculating about computer program ratings and thinking of tests to good measure their strengths and weaknesses. The measure we chess fans want most, however, is how well the programs will do against strong human players. For this there is no test as good as playing under serious tournament conditions, with money at stake, against strong human competition (FIDE rated). The REBEL GM challenge matches are pretty good measures of how well REBEL does against strong FIDE-rated humans. We need similar tests, with many games, by the other programs under just as serious conditions. If computer programs as a group had a bell-shaped curve with a known standard deviation from the 'mean' (or 'center' of the bell curve) when quantities were plotted versus corresponding FIDE ratings, then there are math techniques (well known in statistics) that could peg the center of that bell curve to the center of another bell-shaped curve for a group of strong, FIDE-rated players that played those computers. But, there is no logical reason why continuously developed, released and tested computer programs would have FIDE-ratings that fell in such a bell-shaped curve pattern. That pattern normally comes from things found in nature that have random variations when measured. I vote for a large, double-double round robin, with the top 12 programs versus the best 12 human players that can be assembled. Each human would play 4 40/2 games against each program (2 as White, 2 as Black). With the FIDE-rating results of those 48 games per program, the relative FIDE ratings of *the participating* programs could be established to a degree (still with some + / - statistical uncertainty). --Steve Boak > In fact We should certainly be able to find a few masters who are willing to >play Rebel or any comp at 40/2 over a few weeks period to go ahead and get a new >base rating for some progs to bring the SSDF back into line with Fide ratings as >an attempt to put this GM issue or should i say (2500 fide rating issue) to bed, >and show i was totally right once and for all :)!
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.