Computer Chess Club Archives


Search

Terms

Messages

Subject: More philosophy and math discussion regarding different rating methods

Author: Stephen A. Boak

Date: 11:28:08 07/29/00

Go up one level in this thread


On July 29, 2000 at 07:27:51, blass uri wrote:

>On July 29, 2000 at 06:57:38, Stephen A. Boak wrote:
>
>>On July 29, 2000 at 06:02:41, blass uri wrote:
>>
>>>I understand the ELO system.
>>
>>No you don't--based on what I read in this posting of yours.  And I am not
>>trying to judge you harshly because your English may not be perfect.
>>
>>>The elo system does not use all the information to get the best estimate for the
>>>elo.
>>
>>Yes it does, emphatically, and scientifically (meaning objectively, without
>>subjectivity; meaning unbiased).
>>
>>>
>>>It is using only results and not the games.
>>
>>Right, since an ELO rating is all about predicting the relative results of a
>>player versus other rated players.
>
>The point is that analyzing the games may help in predicting the future results.
>
>  I said 'results', meaning % of available
>>points scored by the player--i.e. (Wins + 0.5 * Draws)/(Total Games Rated).
>>What good is a rating system, if it doesn't predict *results*.
>>
>>An ELO rating is designed (that is its mathematical nature, its intended
>>function) to predict results *of a series of games* against opponents of known
>>ELO rating, not how well a player (or computer) will do in a particular postion,
>>whether opening, middlegame, endgame, whether closed or open, whether when kings
>>castled on the same side or kings castled on opposite sides or with one or both
>>kings uncastled.
>
>I know the purpose of the rating system is to predict results but they assume a
>simple model.

Simplicity, objectivity, unbiasedness, are the cornerstones of good
statistics--the good old scientific method.  A complicated model leads to great
difficulty in modification, troubleshooting & use.

Whose subjectivity shall we use?  Mogens pointed out that personal subjectivity
may throw all semblance of science out the window.  What statistics will you
rely on then?  "Based on the T. Czub method of touchy-feely review, I *think*
that A is stronger than B by approx 35 rating points."--ouch!

There are indeed other methods.  Example--I think we already have examples of
people who *rate* chess programs, using methods that differ from SSDF. That is
my observation, not taking sides (just yet).

I am not saying SSDF uses the best method, even if it is basically ELO based, if
you want to know how strong is a program versus *human* players.  In that event,
I think the SSDF method is of limited usefulness and of limited accuracy.  I
don't say it is worthless, only that it has serious drawbacks, despite its
managers and supporters.  For the managers and supporters of SSDF to rely
(today, years later) on an ancient set of games between ancient versions of
programs on ancient hardware, playing Swedish players using Swedish ratings
(even if ELO-based) and say that the list (today!) is still calibrated for
accurate prediction of rating versus humans is ludicrous.

The lack of solid mathematics behind such a claim makes their publications and
opinions appear biased and self-serving.  In legal arenas, if you can make 10
arguments to support your case, sometimes it is best to *never mention* the
weakest arguments out of the 10.  Why?  If your stronger arguments are so shaky
that you have to rely on the weak arguments to shore them up (bolster them),
then you are in real trouble trying to be persuasive in front of a judge or
jury.  The use of weak arguments undermines the stronger arguments and one
appears to be clutching at straws (grasping at air, at anything, at nothing--for
excuses). [Note--I am not saying you made such claims about SSDF--quite the
contrary you always point out that lack of game scores and settings make the
published ratings suspect (in appearance, at least).]

The goal of an ELO system is to assign a *single* rating number, to express the
playing strength of each player, which may be generally used to compare each
player with all other players in the rating pool.  This is one-stop-shopping,
making the strength figure conveniently into a single rating.

That strength (or rating) shows the relative strength of one player versus the
rest of the pool on the average, and against some subset of the pool (subject to
standard deviation, normal variability, etc).

If you examine the 40-50 moves made in a typical game, for several games, will
you assign a rating for the player for each move, for each type of postion
encountered, for each phase of the game (opening, middle, end)?

If a program has some strengths in some aspects of its game, and some
weaknesses, and the same is true for a human, and if some of those 'partial
ratings' are higher for the program, and some higher for the human, which is
better overall?  Won't you still have to combine them to create a single rating,
somehow, weighting each 'partial rating' in a manner to create a single
comparison of two ratings--one for the program, one for the human?

If A is better than B, and B is better than C, and C is better than A, then
which is best?  How can you rank them, overall, if you only can rank them by a
head to head comparison of strengths and weaknesses?  Surely the only way is to
play many opponents with each player or program, to arrive at a *results* based
conclusion of relative ratings among all pool members.

Is this a perfect method--yes and no.

No, if you want to do something else with your mathematics.  You are free to do
that.  No, if you want to add some strong human review and judgement into the
rating calculations.

Yes, if you want a simple and unbiased method of creating an overall ranking and
relative measure of ability for all members of the rating pool.

Is this the *only* method of creating a rating?  Certainly not.  But it
accomplishes only what it purports to accomplish (I'm speaking of the ELO
system).  One should not stretch the nature of the ELO system, to point out that
it doesn't do what it wasn't designed to do, as though that makes that rating
method invalid.  It doesn't.

>
>The model say that the ability of a player in a game has a normal distribution
>with standard deviation of 200(there is a linear formula for calculating the
>rating that gives almost the same result and I know that this formula is used in
>Israel for calculating the rating system).
>
>This model is wrong and does not use the results of the games correctly to get
>the best estimate.

The model can't be wrong.  It does what it purports to do.  It does that well.

Read Elo's book (or have you already?).  It is a classic of excellent reasoning
and application of mathematics and statistics.  It explains the philosophy
underlying the ELO system--very clearly in my opinion.

>
>I can give a simple example suppose player A is unstable player and play like a
>player with rating 2500 in 50% of the games and like a player with rating 2000
>in 50% of the games.
>
>If A  plays against strong players with rating 2500 A's score will be at least
>25%  and A's rating will be higher relative to the case that A plays against
>weak players.

Oops--you are relying on *results* again!  Good job!  You understood the
distinctions between A playing strong or weak opponents, not by the *analysis*
or *game move evaluation* you proposed, but by looking instead at the *results*,
the simple win-loss-draw and % of points scored that ELO uses.  Where have you
improved or avoided the alleged 'flaws' of the ELO system?


What if: B2 plays against strong players with rating 2500 and gets 30% and B2
plays against weak players of 2000 rating and gets 70%?  Which is stronger, your
postulated A or my postulated B2?  And what math did you use to figure out your
answer?
>
>If I know it from games and A choose to play most of the games against strong
>players than A's rating will be higher than the rating that A deserves based on
>playing against all players.

You describe a possibility that *may* occur, sure.  But under your scenario, the
*method* or test example you give is based on *bias*.  It postulates (shops for)
a suitable set of opponents relative to which the rated player is *underrated*
or *overrated*.  Into how many rating groupings will you group the potential
opponents, to do your subjective analysis testing?

Such a situation is always possible (relatively over/under rated pairings), no
matter what your choice of rating method.  This is simply natural variability at
work--in the quality of play versus higher or lower rated subsets of the rating
pool.  You can't avoid this by a miracle new rating method.

Perhaps you propose to generate relative ratings between *each possible pair* of
members in a rating pool.  How many calculations and individual, pairwise, sets
of relative ratings is that for, say, the 80,000 member USCF rating pool?  I
can't imagine the size of that number of *individual relative ratings*.
Actually, for each member of the rating pool of n players, there would be n-1
relative ratings.  Total would be
n * ( n - 1 ) / 2 .  Quite a list, huh?

And how many games would you have to observe in order to have a reasonably high
level of confidence in each of those relative ratings?

Now the beauty of simplicity is shown in living color.


>
>>
>>This bears repeating--an ELO rating does not even attempt to determine how well
>>a rated player (even if a computer) will do in various phases of the game, or in
>>various types of postions.  All those details are subsumed in the strength
>>assigned to the player based on the bottom line--results: Win, Draw or
>>Loss--against rated opponents.  Against multiple rated opponents, not just one
>>opponent, since an ELO rating is a relative measure of one player's strength
>>versus many other players in the same rating pool, versus the average strength
>>of the pool on the whole.
>>
>>Analyzing in detail the playing styles and skills, contrasting two different
>>players (human or computer), assigning personal rating numbers to programs based
>>on their hodge-podge (mix) of skills and abilities across many positions, is all
>>well and good.  But that is not what the ELO system is designed to do.  That is
>>a personal system one may devise (for perfectly valid reasons)--but not *the ELO
>>sytem* of Dr. Arpad Elo.
>>
>>>
>>>I am sure that it is possible to do a calculating rating program that will give
>>>better estimate for the rating by not only counting the results but also by
>>>analyzing the games and evaluation of programs.
>>
>>The ELO system is not designed to predict perfectly a single game, or even the
>>results of playing a single opponent (for example, A versus B only).  It is
>>designed to predict the %score expected when A plays B, C, D, E, F, G, etc, of
>>with known ELO ratings and ELO rating average.
>
>I know and I believe that it is possible to predict better the results of A
>against B,C,D... if you also analyze the games.
>
>>
>>Because the ELO system presupposes natural variability, it doesn't guarantee any
>>particular score, against any particular individual (nor against any particular
>>field of opponents).
>>
>>The ELO system doesn't only predict results.  It handles the adjustment of the
>>player's rating, according to recent *results*.
>>
>>It adjusts an ELO rating up, when the % of points scored is higher than that
>>predicted by the relative ELO ratings of the player and each of his opponents.
>>It adjusts an ELO rating down, when the % of points scored is lower than
>>predicted by the relative ELO ratings of the player and each of his oponents.
>>
>>>
>>>It is not simple to do this program and I am not going to do it but it is
>>>possible.
>>>
>>>
>>>Here is one example when you can learn from analyzing games things that you
>>>cannnot learn from watching results without games:
>>
>>I agree you can learn things from watching the details (move choices) of a
>>game--about both players.
>>
>>>
>>>Suppose you see that in one game program A outsearched program B and got
>>>advantage by the evaluation of both programs.
>>>
>>>The evaluation of both programs was wrong and program A lost

Here you are discussing *results* ( 1 , 0.5 , 0 ), just like ELO based systems.

 because the
>>>position that both programs evaluated as clear advantage for A was really a
>>>losing position for A.
>>>
>>>If you analyze the game you can understand it and increase A's rating based on
>>>this game.

Increase A's rating relative to what other player(s)?  B only, or the pool at
large?  Or a subset of the pool at large (say, 2500 rated players or above)?

>>
>>Uri, if the positional skills of program B outweigh (normally) the increased
>>search capability of program A, then it is possible that program B is stronger
>>than program A.  By stronger, I mean that B will achieve better results than A,
>>in a head to head competition.
>>
>>Perhaps A outsearches B only on rare occasions (even in several observed games
>>in a row).  Or A oursearches B (as in your own given example) but A doesn't win
>>the game (as in your example).  How can you conclude A is better (based on
>>deeper search) when the *results* of that search didn't obtain a victory.
>
>I can see in the game that both programs did not understand the position so
>better positional understanding was not relevant in the game.
>
>I am talking about a case that program A went to a lost position because it
>outsearched the opponent.
>
>if program A won a rook by outsearching the opponent and did not see that B has
>a mate attack and the evaluation of B also discovers that B did not understand
>that it has a mate attack then there was no superior positional understanding of
>B.
>

But also, there was no superior search capability (we say in basketball here--no
harm, no foul!), since neither had adequate search capability to truly assess
the position.  :)

Isn't it possible that A can select a move based on a better deep/fast search
capability; and B can select a move based on a better positional understanding,
and both A & B can be fooled!  Neither is perfect, but who wins the clash of
differing strengths and weaknesses?  *Results*, my friend, again will tell the
tale, based on unbiased simplicity, just like the ELO system.

>Uri



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.