Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: A question of rating schemes

Author: Robert Hyatt
Date: 21:55:53 06/19/02
On June 19, 2002 at 07:18:25, GuyHaworth wrote:

>
>There are ELO rating lists for:
>
>  people    (on the basis of human-human games ... FIDE-managed), and
>  computers (on the basis of computer-computer games)
>
>There are apparently some intrinsic problems with rating schemes, maybe
>particularly ELO which was the first, and I am looking for more information on
>this.
>
>Each list would be equally valid if N ELO points were subtracted from all
>participants ... so the absolute numbers mean nothing.  Ok, that would be easy
>to fix if there were rated people-computer games.  So ....
>
>... is there an ELO list purely on the basis of computer-human games.


There are many problems to overcome.  The SSDF has tried, on a couple of
occasions, to normalize their ratings to FIDE based on results of a few
programs vs FIDE players.  Unfortunately, this is statistically invalid
and just changes the SSDF ratings but they still have nothing to do with
how the machines would do vs humans.  And after a few comp vs comp testing
cycles, the human effect is washed out and the ratings are back to their
old inflated status.

you can't take two pools, "freeze them" pick a few players from each
and play a few games, then use the ratings from one pool (FIDE) to set
the ratings for the few common opponents from the other pool (SSDF) and
then normalize the rest of the SSDF pool to those computers that had their
ratings normalized to a few FIDE players.

That is an average of averages and is beyond useless.  The ratings will pass
most tests for valid random number sequences...



>
>I have also heard that there is an 'inflation effect' with ELO.  What is this -
>and has anyone an 'ELO game simulator' to demonstrate this?  I would expect that
>there are more games played in SSDF to rate the engines than contribute to the
>FIDE human ELO ratings:  is this correct?  If so, I'd expect the inflation
>effect in the SSDF list to be greater.

It is.  Just look at the top of the SSDF list.  It is unavoidable.  Because
each year you get a new and better "kasparov" added to the list.  In the real
world, you don't get a group of new players each year that are stronger than
everyone else.  Yet exactly that happens with computers each year...

>
>Would it be good to get the Kramnik-DeepFritz computer rated in SSDF as well as
>having its match rating against Kramnik?  Presumably ChessBase are able to rate
>it against Fritz engines in SSDF.
>


It will give an estimate of Fritz's rating in a pool of two players, one be
ing Kramnik, the other being Fritz.  Trying to predict how Kramnik would do
against other programs by comparing his result with fritz and fritz's result
with other programs is again not going to work very well at all...

Elo defined a good system.  And so long as a single pool of players is used,
the ratings are amazingly consistent and their predictive power is good.  But
_everybody_ finds clever ways to corrupt the process yet say "the ratings can
be compared."

They are wrong.



>Finally, are there better rating schemes than ELO - or are they just different.
>


Elo just formalized traditional sampling methods and outcome prediction
methods.  Nothing "new" whatsoever, other than how it is applied to chess in
particular.  It all traces back to the central limit theorem.



>g
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.