Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Odds of ratings drift being 196 poinys off of Fide ratings?

Author: Len Eisner

Date: 19:48:38 11/29/99

Go up one level in this thread


On November 29, 1999 at 22:21:58, Stephen A. Boak wrote:

>On November 29, 1999 at 20:07:51, Robert Hyatt wrote:
>
>>On November 29, 1999 at 18:30:26, Charles Unruh wrote:
>>
>>>On November 29, 1999 at 18:08:38, Tim Mirabile wrote:
>>>
>>>>On November 29, 1999 at 14:37:31, Charles Unruh wrote:
>>>>
>>>>>On November 29, 1999 at 14:08:46, Enrique Irazoqui wrote:
>>>>
>>>>>>Why don't we forget about absolute ratings and make them program-relative
>>>>>>instead?
>>>>>
>>>>>Because most of us as well as most consumers are not as concerned with the
>>>>>relative strength of comps vs comps as we are with comps vs humans.
>>>>
>>>>Then we should be playing hundreds of games with the programs against humans of
>>>>various strengths above and below what we think is the approximate strength of
>>>>the program.
>>>
>>>Well of course!  That's what everyone wants but it's not going to happen, we
>>>don't have that so we have to use what's available.
>>>
>>> Anything less will not give you what you want to an accuracy
>>>>of better than a few hundred points.
>>>
>>>The ssdf was originally based on games vs humans, we can't calivrate the current
>>>ratings back exactly, but the drift should be able to be calclated within at
>>>least +/- 50 or so
>>
>>
>>based on what statistical theory?  The rating pools have changed dramatically.
>>The ratings have nothing to do with the original SSDF ratings now, as none of
>>the original 'pool' remains active...
>
>I have the same question as Hyatt.
>
>Here's the scenario I envision:
>
>Assume programs are a combination of abilities in strategical, positional and
>tactical move selection.
>
>I use 'strategical' to mean a capability of long term assessment that can help
>steer the program into positions that are likely to be more favorable (or less
>unfavorable) than the current position, and avoid steering the program into
>positions that are likely to be less favorable (or more unfavorable).
>
>Assume programs are very weak in strategical capability, only fair to middling
>in the positional capability, and strong in the tactical capability.

I think this is largely a myth.  The current crop of programs are much better
positionally and strategically that CCC people give them credit for.  Otherwise,
they could not beat strong masters at any time control.  Combinations are only
possible if you have a positional advantage.  So these programs must be getting
better positions against masters to make use of their tactical abilities.  Keep
in mind that I am not talking about GM's and IMs.  I'm saying that today's
programs hold their own in all aspects of the game against everyone below IM
strength, and that's pretty impressive.  If I'm wrong, show me games where the
programs play much worse than master level positionally and/or strategically,
and then come up with the shot out of the blue in an inferior position to win.
I'm not saying that programs don't make positional and strategic mistakes, they
do.  But so do masters.  Otherwise they would be IMs or GMs.  I'm also not
saying that programs have a lot of strategic and positional knowledge built in.
I don't know or care about that.  I only care about the moves they play over the
board.  Not if they understood why they played a move.


>Rate these programs initially by competing them in many, many games against a
>pool of chess players rated in some rating system, FIDE for example.
>
>Now after obtaining initial ratings based on games against FIDE-rated human
>players, begin to improve the capabilities of the programs.
>
>Since programming strategical and positional ability in software is apparently
>more difficult than programming tactical calculations, assume all the programs
>are much improved tactically, step by step, over a long time, while the
>strategical capabilities are hardly improved at all, and the positional
>capabilities are very slowly improved.
>
>As various programs are improved, assume the new versions complete only against
>the older program versions with established FIDE ratings (with essentially no
>more games against FIDE rated humans).
>
>Since the newer versions will be about the same strength strategically (still
>very weak compared to better human players), only slightly better positionally
>(perhaps still only moderate, compared to better human players), but much better
>tactically than the older rated versions (and much improved versus better human
>players), the newer versions will outplay the older versions (largely due to
>tactical improvement) and gain rating points.
>
>Over time, assume the newer programs are competed against the relatively older
>but more recent programs only (never against the very oldest programs).
>
>As the program generations are released and tested against recent other program
>generations, the newer programs continue to climb in rating--being always
>better, largely for tactical reasons, than their predecessors.
>
>However, since the programs are not improving in the strategical area, and only
>very slowly in the positional area, they may *not* be gaining as much relative
>strength in real life, versus humans, as they are gaining in ELO growth from
>continued play versus computer programs only (no humans).
>
>When the later generations are finally tested again in many games versus
>stronger humans, their tactical skills hold up, and may even surpass the skill
>of most of some very strong human players (especially under faster time controls
>or in complicated positions that sometimes arise); their positional skills may
>not be quite as good as the strongest human opponents, since those abilities
>improved very slowly in the programs--even over many years.  And the strategic
>skills of the computers may be still woefully behind strategic skills of most of
>the stronger human players--since that aspect is very difficult to program, and
>is very difficult to weight versus more short term positional and tactical
>factors.
>
>The stronger human players will pick the programs apart at the seams, finding
>the strategic and positional weaknesses of the programs.  They may even find
>occasional 'tactical situations' to exploit, where the programs use null move
>approaches that on occasion discard quiet moves, without adequate analysis, that
>lead to winning or losing positions (moves not overlooked by strong humans).
>
>This scenario points out how programs can continually raise their ratings in a
>largely self-contained program pool (that was once rated in a reasonable
>fashion), gaining rating points at the expense of older program versions (not at
>the expense of rated humans!) by improving their 'program-type' skills (largely
>tactical) versus their prior peer programs, while at the same time gaining much
>less relative strength against strong humans who continue to use (and excel in)
>strategical and positional understanding to combat the increased tactical skills
>of software programs.
>
>The ratings of the computer programs will be found to be inflated.  Determining
>the amount of the inflation a priori (in advance), before the latest computer
>programs again play many games against rated players, is highly problematic and
>extremely speculative at best.
>
>The difficulty, as pointed out by Hyatt, is that there is no exact mathematics
>or statistics to show how much inflation has occurred in program ratings (versus
>human ratings) when the two competing pools (computer and human) have diverged
>and remained non-overlapping for lengthy periods.
>
>There is some hope to carry out this calculation of inflation, in my opinion, in
>lieu of playing huge numbers of games with many programs versus many strong,
>rated human players.
>
>That is based on having some current programs play enough rated games against
>strong humans (at suitable time controls, incentives and controlled playing
>conditions, as in the Rebel GM challenge matches, for example).
>
>As an example of this, if Rebel 10 is assumed to be approx 2500 rating strength
>based on enough GM challenge matches, and a few other programs are tested in
>somewhat similar fashion (perhaps even in strong human open or closed
>tournaments, swiss or round robin), then the relative program ratings (among the
>computer pool) could be adjusted up or down, closer together or farther apart,
>to tie them as a group to the 'pegged values' of the programs tested against
>humans, until the relative ratings of humans and programs are perceived as
>better estimated (i.e. inflation seems reduced, with respect to computer program
>ratings and human ratings--this is difficult to 'prove' without lots of actual
>program-human testing, with many programs and many strong humans) and in accord
>with the human-rating-indexed programs such as Rebel 10.
>
>Even this is very, very speculative since the best tactical program, with the
>highest computer rating, may possibly have worse strategic and positional skills
>than a somewhat lower rated computer program, which might render its rating
>agains humans relatively lower than the lesser computer program (which may have
>better strategical and positional skills).
>
>There is no guarantee that real testing against humans will produce the same
>relative rankings among computer programs as the previous
>all-computer competition has produced.  This is what Ed Schroder and others have
>indicated many times in many ways.  This is because we have no defined or
>measured and tested scale against which to determine the relative effects of
>strategic, positional and tactical skills of a program versus a strong human,
>which are interrelated factors for the strength of both programs and humans.
>
>The only true test is to pit many top programs many times against many top
>humans, under reasonably controlled conditions.
>
>--Steve Boak



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.