Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: More philosophy and math discussion regarding different rating methods

Author: Stephen A. Boak
Date: 20:00:11 07/29/00
On July 29, 2000 at 18:13:32, Ratko V Tomic wrote:

>>>This model is wrong and does not use the results of the games correctly to get
>>>the best estimate.
>>
>> The model can't be wrong.  It does what it purports to do.
>> It does that well.
>
>The statistical model (memoryless random process) which ELO computation
>assumes to undelie the variablity in results is certainly not the
>accurate model for that variablity.

I do not claim that the ELO model is the most accurate.  But is not highly
inaccurate.  In fact, it is the opposite--it is very accurate.

Any scientific statistical model is fine, as long as you have some
mathematically meaningful error measures to assess its accuracy.  ELO predicts
results extremely accurately (not in specific, individual or handfuls of games,
but overall, for thousands of rating pool members, tournament after tournament.
A-class players finish above B-class players and below Expert-class players,
over and over and over.  ELO is self-correcting (over time) since ratings will
be adjusted periodically, after a series of recent results.  I have always said
ELO lags performance, but that is nothing astounding--it works both ways, for
rising players as well as falling players.  The problem of adjusting ELO ratings
to match changing results exists for any rating system, no matter how simple or
complex.

 The most accurate model would be
>the replica of the player itself. With chess programs one can make such
>exact model, not with humans.

We are not talking about a model for predicting the exact move a program would
make (although you can desire one), but a model for predicting the results of a
program (win-loss-draw, ratio of points gained) versus opponents.  The computer
will play its move (per itself)--just let it play.  Test it and you will see.

That doesn't determine the result of any particular game with any great
accuracy.  A game is full of many moves, any one of which might be a loser or a
winner, or a choice of indeterminate value.  Both humans and computers have
evaluation problems and horizon problems.  We haven't solved 'chess' you know.
Therefore knowing the exact moves a computer will play (assuming it is
exhaustively possible) will not in any reasonable manner determine the outcome
of a match between comp-comp let alone comp-human.

You misunderstand the nature of natural variation, applied to these
circumstances, i.e. for testing and measurement of chess skill.

1. All testing involves natural variation.  No two tests are exactly the
same--ever.

2. This is self-evident in the chess world, where humans perform a bit better or
worse than their ratings would predict (and sometimes the same, yes).  Humans
vary.  In choice of openings, plans in the position, in how much risk to take in
what kinds of positions.  The notion of a single rating is a bold one, but
highly reasonable when natural variation is assumed and accounted for.

3. Comp-human games have variation.  Partly due to human variability and the
inexactness of assigning a human rating.  Partly due to programs that may not
play the same move in every identical position (due to indeterministic
programming in some cases, due to varying hash table contents when arriving at
similar positions via different move orders, due to the number of program clock
cycles (use of thinking time) differing from game to game and move to
move--including variancy in time taken by the human to reply.

4. Comp-comp games have variation.  Many have learners, crude though they may
be, so as to not repeat lines lost previously.  It would take many games to
exhaust comp-comp variations, since most have large and different books, and
varying styles will lead to different byways being taken in the large majority
of games (I know sometimes there is an occasional duplicate game).

5. If this variability did not exist, we might not be having these discussions
about improved rating systems.

The statistical model for variability
>used in ELO is the simplest nontrivial statistical model.

Simple statistical models are preferred over complex statistical models when the
predicative value is adequate.  This is textbook theory--never create a model
containing excess factors if there is a simpler one available that is adequate
to the task.  Of course, we have to determine for ourselves what is our task,
then we can select from competing models the one we think accomplishes it the
best.

>Of course, what you mean by "the model can't be wrong" is that one
>can apply correctly the inaccurate model

Models are not right or wrong.  If they have any predicative power, then they
have some usefulness.

Show me the model that predicts results better than ELO.  When you do (not
simply by hypothesizing about the possibility of creating one), we surely will
undergo comparative testing (using statistics, of course--which we will have to
agree on!) to validate the claim.  It is possible, of course.  But that doesn't
mean the ELO system doesn't by and large, with great accuracy, fulfill its
function--make the predictions it was designed to predict.

 in the sense of recognizing
>its incorrectness. That doesn't mean it reflects the process it models
>accurately or even well or better than any other model.
>
>The Uri's assertion is that the model ELO uses for mimicking variability
>in results is by no means accurate (corresponding to the actual results)

You banter the word accurate as though models are 'right' or 'wrong', 'accurate'
or 'inaccurate'.  All reasonable models of 'complex behavior and results' will
be, a priori, more accurate in some circumstances, less accurate in others.  You
know that.

>and it isn't even the best one in present day and age.

I don't doubt that there are other models possible.  I have heard of some and
read descriptions of some.

I haven't yet seen any math or testing to show another model is 'better' than
the basic ELO sytem.  (although I freely admit it is hypothetically
possible--and may even exist today)

 As he suggested,
>one could in principle write a program which could extract much more
>information from the game (e.g. via analysis and scoring of each ply)
>than ELO model does and be more accurate in predicting results than ELO.

In principle, we can improve any model.  In practice, we can improve any model
(as a generality).  What is new or exciting about this observation?  I don't
deny it.

Program A could play the world's worst endgames--where it would lose to any
chess beginner of any level.  Yet if the computer plays the opening and middle
games with some skill, it may not always reach an endgame.  How does that added
information allow better results predictions?  It is possible, certainly, but
the assessment of how often will it get to a 'lost' endgame must be made.

How does one assess that?  By playing many complete games against many different
opponents and extracting the minimum necessary bits of data--the win-loss-draw
results of the games.  We are full circle, back to the basics that have to be
included.  Complex models will have to rely on the basic results data to be
developed, tested, and modified.   Ratings based on complex models will have to
conform (be adjusted) to reality--basic results: win-lose-draw and ratios.

>The ELO model uses about 1.58 bits of info for the entire game. The
>strength analyzer program Uri mentioned would use about 5 bits of info
>per ply, or hundreds of times more info per game about the process
>than the ELO does.

More info does not necessarily mean better predictions.  If the gathering or
management of the increased info makes the modeling impractical, you have to
toss it out.  Ruthlessly.  Or you will bog down.  Designing models for
simplicity is a background goal, for all modelers.
>
>Of course, ELO method is from the pre-computer era, devised to do best
>one can assuming:
>
>1) evaluator need not know anything about the chess
>2) evaluator need not use anything beyond the slide rule or
>   log tables and pencil and paper to compute quickly
>   the ratings and the predictions.

Wrong emphasis, to minimize the power of the ELO method by the claim that it is
old fashioned.  These are 'restrictions' of science, both old and new (which you
support--I've read many of your postings on many topics).  Repeatable tests with
repeatable conclusions, under controlled circumstances.  Unbiased measurements
(results, not subjective quality grades on individual moves).  Unbiased means
even an unknowledgeable person could conduct the test according to the
established criterion--and get the same result.  This is one 'test' of
science--that it is repeatable, where the tester is not 'special' or 'biased',
causing the results to lean one way or another.

Yes, I know some statistical models use fuzzy information, or management
opinion, to improve the modeling.  I don't count out that possibility.  But this
is not always repeatable, when we are judging chess moves.  As pointed out, GM
opinions may differ (this is what a chess game between GMs is all
about--different ideas as to move selection that will optimize the bottom line
result).

We have a tendency (understatement) in science to believe unbiased rating
estimates, not those that might vary due to the abilities of the particular
expert (a GM?) whose input is relied on by a model for some information (his
opinion).

>
>If you drop either or both of these upfront restrictions (which are an arbitrary
>historical accident of what technology was available at the time) you can do
>much better in terms of predicting the outcomes from the previous games.

A
>trivial example of such improved prediction for comp-comp play would be to run a
>simulation of the programs on a much faster computer than what they would play
>in a competition for which you're trying to predict an outcome. Such model is
>obviously better than ELO rating.

Come on, Ratko, this is indeed a trivial example, and beneath mention!  Don't
grasp at straws with this banal hypothetical.  You can do better than that.

A. You just proposed one argument that counters your philosophical points.  In
essence you have just said that to predict results better than ELO (which is
based on results), if you can collect results faster (by modeling on a faster
CPU), then you will have more accurate ratings.  heh heh!

B. You indicate if games can be played faster, by computer, then you can arrive
at more results and therefore an improved confidence level (smaller standard
deviation) for ratings.  (so we need more ELO type results gathering to improve
ELO--what else is new!)

C. You casually skip the step (even though you like to discuss hypothetically
improved models) of how you would model human play on a faster CPU--that is our
key point isn't it, central to our overall CCC debate: Is a program better than
a human, and if so, by how much; and what about another program, does it play
humans better or worse, relative both to humans as well as the other program.

The beauty of the ELO system is that it was constructed to be a scientific way
of establishing meaningful (not perfect!) ratings, unbiased ratings (a key point
by A. Elo), good predicative ratings (predicative of results).  It is that, and
very accurate as a generality; and thus far no better alternative has been
proven (but it is possible, I grant).

The search for a better rating system is a fine one, but it needs to get past
the hypothetical stage to practical examination of how it will work, and the
science/math it will use, otherwise it is a dream unfulfilled.  And it should be
measured against the de facto standard of the times (the ELO system) to show
where, when and how it will furnish better ratings, and better predicative
results.
Re: More philosophy and math discussion regarding different rating methods Ratko V Tomic 00:55:55 07/30/00
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.