Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: New rating list based upon Human games /SSDF brought back into line

Author: Stephen A. Boak

Date: 01:21:56 12/11/99

Go up one level in this thread


>On December 10, 1999 at 11:25:58, Charles Unruh wrote:

>I wonder is it possible that Once we get 20 games of GM's against Rebel at 40/2
>(there also seems to be a good number of master and GM games vs Junior)that we
>can then recalculate/readjust the ratings of the ssdf by first calculating the
>ratings of other comps vs Rebel(with the calculated rating from it's 20 games.).

I suggested the same possibility, in a post a few weeks ago.

There are multiple conceptual problems with this approach, as I pointed out
then.  Recalibrating the SSDF ratings based on a single, recently FIDE rated
program, such as Rebel, would have to overcome the following objections:

1. The individual rating of each SSDF rated computer program is not known with
respect to play against FIDE rated humans.

2. There is no simple math formula to know how well a computer program will play
against humans, based on its SSDF rating.

One could devise a formula, based in part on the SSDF ratings, but one could not
test such a formula for Standard Error of Estimate (SEE), or Standard Error of
Forecast, without playing many, many 40/2 games between many of those programs
and strong FIDE-rated players.  And that formula might not work for programs not
so tested during the development and checkout of the formula.

3. It is not obvious that the relative rating of computer programs in comp-comp
play will hold for the same programs when they play humans at 40/2 time
controls.

A computer program that can out think (let's say play tactically better) other
computer programs will have a higher comp-comp rating.  That same program *may*
have strategic and positional weaknesses that are worse than the strategic and
positional flaws of the programs it beats using its tactical advantages--which
weaknesses are not taken advantage of by those weaker programs.

By contrast, reasonably strong human players might exploit those strategic and
positional weaknesses more readily against the strongest comp-comp program than
against its lesser rated (comp-comp) competitors.  A program is a combination of
strategic, positional and tactical abilities.  Since most programs are weak in
the strategic planning aspect, they establish their comp-comp ratings based more
on their shorter term positional and tactical abilities.  Their strategic
weaknesses are not strongly reflected in their relative comp-comp ratings, since
those ratings emphasize their tactical skills.  When those programs play
relatively strong FIDE-rated humans, they will likely receive FIDE-type ratings
that more closely reflect their relative strategic skills (and weaknesses)
because the humans will seek to beat the computers by attacking their known
weaknesses.

4. Therefore, even if the FIDE rating of one program (example, REBEL) is known,
the relative rating spread among the many SSDF-rated computer programs is not
known with respect to their performance against FIDE rated humans.  The relative
SSDF ratings of the various programs might actually be significantly different,
after those programs played lots of 40/2 games with strong human players.

I enjoy speculating about computer program ratings and thinking of tests to good
measure their strengths and weaknesses.  The measure we chess fans want most,
however, is how well the programs will do against strong human players.  For
this there is no test as good as playing under serious tournament conditions,
with money at stake, against strong human competition (FIDE rated).

The REBEL GM challenge matches are pretty good measures of how well REBEL does
against strong FIDE-rated humans.  We need similar tests, with many games, by
the other programs under just as serious conditions.

If computer programs as a group had a bell-shaped curve with a known standard
deviation from the 'mean' (or 'center' of the bell curve) when quantities were
plotted versus corresponding FIDE ratings, then there are math techniques (well
known in statistics) that could peg the center of that bell curve to the center
of another bell-shaped curve for a group of strong, FIDE-rated players that
played those computers.  But, there is no logical reason why continuously
developed, released and tested computer programs would have FIDE-ratings that
fell in such a bell-shaped curve pattern.  That pattern normally comes from
things found in nature that have random variations when measured.

I vote for a large, double-double round robin, with the top 12 programs versus
the best 12 human players that can be assembled.  Each human would play 4 40/2
games against each program (2 as White, 2 as Black).  With the FIDE-rating
results of those 48 games per program, the relative FIDE ratings of *the
participating* programs could be established to a degree (still with some + / -
statistical uncertainty).

--Steve Boak

> In fact We should certainly be able to find a few masters who are willing to
>play Rebel or any comp at 40/2 over a few weeks period to go ahead and get a new
>base rating for some progs to bring the SSDF back into line with Fide ratings as
>an attempt to put this GM issue or should i say (2500 fide rating issue) to bed,
>and show i was totally right once and for all :)!



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.