Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Odds of ratings drift being 196 poinys off of Fide ratings?

Author: Stephen A. Boak

Date: 19:21:58 11/29/99

Go up one level in this thread


On November 29, 1999 at 20:07:51, Robert Hyatt wrote:

>On November 29, 1999 at 18:30:26, Charles Unruh wrote:
>
>>On November 29, 1999 at 18:08:38, Tim Mirabile wrote:
>>
>>>On November 29, 1999 at 14:37:31, Charles Unruh wrote:
>>>
>>>>On November 29, 1999 at 14:08:46, Enrique Irazoqui wrote:
>>>
>>>>>Why don't we forget about absolute ratings and make them program-relative
>>>>>instead?
>>>>
>>>>Because most of us as well as most consumers are not as concerned with the
>>>>relative strength of comps vs comps as we are with comps vs humans.
>>>
>>>Then we should be playing hundreds of games with the programs against humans of
>>>various strengths above and below what we think is the approximate strength of
>>>the program.
>>
>>Well of course!  That's what everyone wants but it's not going to happen, we
>>don't have that so we have to use what's available.
>>
>> Anything less will not give you what you want to an accuracy
>>>of better than a few hundred points.
>>
>>The ssdf was originally based on games vs humans, we can't calivrate the current
>>ratings back exactly, but the drift should be able to be calclated within at
>>least +/- 50 or so
>
>
>based on what statistical theory?  The rating pools have changed dramatically.
>The ratings have nothing to do with the original SSDF ratings now, as none of
>the original 'pool' remains active...

I have the same question as Hyatt.

Here's the scenario I envision:

Assume programs are a combination of abilities in strategical, positional and
tactical move selection.

I use 'strategical' to mean a capability of long term assessment that can help
steer the program into positions that are likely to be more favorable (or less
unfavorable) than the current position, and avoid steering the program into
positions that are likely to be less favorable (or more unfavorable).

Assume programs are very weak in strategical capability, only fair to middling
in the positional capability, and strong in the tactical capability.

Rate these programs initially by competing them in many, many games against a
pool of chess players rated in some rating system, FIDE for example.

Now after obtaining initial ratings based on games against FIDE-rated human
players, begin to improve the capabilities of the programs.

Since programming strategical and positional ability in software is apparently
more difficult than programming tactical calculations, assume all the programs
are much improved tactically, step by step, over a long time, while the
strategical capabilities are hardly improved at all, and the positional
capabilities are very slowly improved.

As various programs are improved, assume the new versions complete only against
the older program versions with established FIDE ratings (with essentially no
more games against FIDE rated humans).

Since the newer versions will be about the same strength strategically (still
very weak compared to better human players), only slightly better positionally
(perhaps still only moderate, compared to better human players), but much better
tactically than the older rated versions (and much improved versus better human
players), the newer versions will outplay the older versions (largely due to
tactical improvement) and gain rating points.

Over time, assume the newer programs are competed against the relatively older
but more recent programs only (never against the very oldest programs).

As the program generations are released and tested against recent other program
generations, the newer programs continue to climb in rating--being always
better, largely for tactical reasons, than their predecessors.

However, since the programs are not improving in the strategical area, and only
very slowly in the positional area, they may *not* be gaining as much relative
strength in real life, versus humans, as they are gaining in ELO growth from
continued play versus computer programs only (no humans).

When the later generations are finally tested again in many games versus
stronger humans, their tactical skills hold up, and may even surpass the skill
of most of some very strong human players (especially under faster time controls
or in complicated positions that sometimes arise); their positional skills may
not be quite as good as the strongest human opponents, since those abilities
improved very slowly in the programs--even over many years.  And the strategic
skills of the computers may be still woefully behind strategic skills of most of
the stronger human players--since that aspect is very difficult to program, and
is very difficult to weight versus more short term positional and tactical
factors.

The stronger human players will pick the programs apart at the seams, finding
the strategic and positional weaknesses of the programs.  They may even find
occasional 'tactical situations' to exploit, where the programs use null move
approaches that on occasion discard quiet moves, without adequate analysis, that
lead to winning or losing positions (moves not overlooked by strong humans).

This scenario points out how programs can continually raise their ratings in a
largely self-contained program pool (that was once rated in a reasonable
fashion), gaining rating points at the expense of older program versions (not at
the expense of rated humans!) by improving their 'program-type' skills (largely
tactical) versus their prior peer programs, while at the same time gaining much
less relative strength against strong humans who continue to use (and excel in)
strategical and positional understanding to combat the increased tactical skills
of software programs.

The ratings of the computer programs will be found to be inflated.  Determining
the amount of the inflation a priori (in advance), before the latest computer
programs again play many games against rated players, is highly problematic and
extremely speculative at best.

The difficulty, as pointed out by Hyatt, is that there is no exact mathematics
or statistics to show how much inflation has occurred in program ratings (versus
human ratings) when the two competing pools (computer and human) have diverged
and remained non-overlapping for lengthy periods.

There is some hope to carry out this calculation of inflation, in my opinion, in
lieu of playing huge numbers of games with many programs versus many strong,
rated human players.

That is based on having some current programs play enough rated games against
strong humans (at suitable time controls, incentives and controlled playing
conditions, as in the Rebel GM challenge matches, for example).

As an example of this, if Rebel 10 is assumed to be approx 2500 rating strength
based on enough GM challenge matches, and a few other programs are tested in
somewhat similar fashion (perhaps even in strong human open or closed
tournaments, swiss or round robin), then the relative program ratings (among the
computer pool) could be adjusted up or down, closer together or farther apart,
to tie them as a group to the 'pegged values' of the programs tested against
humans, until the relative ratings of humans and programs are perceived as
better estimated (i.e. inflation seems reduced, with respect to computer program
ratings and human ratings--this is difficult to 'prove' without lots of actual
program-human testing, with many programs and many strong humans) and in accord
with the human-rating-indexed programs such as Rebel 10.

Even this is very, very speculative since the best tactical program, with the
highest computer rating, may possibly have worse strategic and positional skills
than a somewhat lower rated computer program, which might render its rating
agains humans relatively lower than the lesser computer program (which may have
better strategical and positional skills).

There is no guarantee that real testing against humans will produce the same
relative rankings among computer programs as the previous
all-computer competition has produced.  This is what Ed Schroder and others have
indicated many times in many ways.  This is because we have no defined or
measured and tested scale against which to determine the relative effects of
strategic, positional and tactical skills of a program versus a strong human,
which are interrelated factors for the strength of both programs and humans.

The only true test is to pit many top programs many times against many top
humans, under reasonably controlled conditions.

--Steve Boak





This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.