Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: More philosophy and math discussion regarding different rating methods

Author: Ratko V Tomic
Date: 00:55:55 07/30/00
Most of the objections you brought up arise from apparently different semantics
of the terms "model" or "modeling" we assumed. Being by education a theoretical
physicist, in my usage the model is a mathematical procedure (or a computational
procedure) designed to mimick some phenomenon or aspects of a phenomenon. The
fundamental natural laws used in physics represent the seed or a generator of
the models "grown out" of these seeds to mimick a particular phenomenon of
interest. The procedure of "growing out" a model (capable of predicting some
measurement results) from the fundamental laws involves application of initial
and boundary conditions to the laws (which are usually in the form of partial
differential equations or integro-differential equations).

Once the model is constructed, it is run forward (via some computation) to find
out what will happen within the model. If the seed was right and was grown
right, the prediction will match the measurement on the actual phenomenon. In
practice this process is iterative, i.e. should measurement result differ
significantly from the prediction one goes back assuming in the first phase that
there was a mismatch between the initial/boundary conditions (which were applied
to the fundamental laws/seed) and the actual initial & boundary conditions which
existed in the tested physical system.

The instruments and setup may be recalibrated here and tested against some
auxilary model of this sub-experiment. If that fails one assumes that the
desired initial/boundary conditions are not practically feasable within the
available set-up technology and one can try working with some more feasable form
of initial/boundary conditions. But once the match between the initial/boundary
conditions is established (and verified) and any computational errors excluded,
the further mismatches in the model prediction and experiment imply that either
the fundamental laws are incorrect or more often incomplete (i.e. there are
additional phenomena involved in the setup which the laws deployed to create the
model didn't capture).

That's the rough sketch illustrating my semanitcs of the terms "model" and
"modeling." Or briefly, the model of some system is another scaled down or a toy
system which mimicks to some degree or some aspects of the system being
modelled. The motivation for the modeling is that it is much cheaper and safer
to run forward (in time/space) the toy model and see what happens (at least
partially) than to do it with the real system and deal with irreversible
consequences.

Now, going to the chess performance modeling -- the model here, in most general
case, is any computational procedure designed to predict the outcome of one or
more games between some set of opponents.

In an extreme case of the "perfect predictor model" for some set of players,
this procedure would predict exactly the outcome of each game. Since computer
programs are deterministic finite state machines, where a fairly small input
data (of order of few Mb) specifying the initial/boundary state (which includes
all the program settings, learned data, books, clocks, seeds for any random
number generators, etc) determines uniquely every move the program
will make next. Of course, there is no practical reason for anyone to invest
time & money in this type of a perfect modeling of chess programs, but at least
in principle a perfect predictor could be made, i.e. for the comp-comp games the
perfect predictor exists at least in principle.

For human-human or human-comp sets of players, we don't know whether the perfect
model exists even in principle. That depends first on whether the _ultimate_
fundamental laws of physics are intrinsically non-deterministic or
deterministic. Although the present quantum theory is non-deterministic, there
may exist an underlying deterministic theory (Bell's theorem notwithstanding)
replicating the statistical predictions of quantum theory.

But even if the ultimate laws are fundamentally non-deterministic, it still may
be the case that the brain processes involved in chess playing (or general
thinking) are sufficiently deterministic in the sense that some finite set of
measurements will allow prediction of the brain's computation result for some
limited time interval into the future. For example of this type of practically
deterministic process on top of a non-deterministic more fundamental process one
can take a regular digital computer, which although its underlying processes are
quantum, thus non-deterministic within the present fundamental postulates, is
designed in such a way to allow for a finite set of initial "measurements" (such
as examining the memory & CPU content) to determine uniquely and with practical
certainty (although not an absolute certainty) what similar "measurements"
(examination of the memory & CPU content) will yield in any future instant.

So far the conclusion is that the perfect performance predictor model exists, at
least in principle, for any comp-comp play. It is an open question whether such
perfect predictors exist, even in principle, for human play.

Of course, in comp-comp case, this perfect model for performance prediction
would predict much more than what ELO aims to predict -- it would predict not
only some long term average performance with certain probability but it would
predict the outcome of individual games and of each move in each of these games.
In any case, this certainly closes the issue whether ELO is the best model
possible for predicting outcome of comp-comp chess games. It is clearly not. One
can say in general that any finite deterministic system (such as comp-comp chess
playing) can be modelled perfectly on a computer, be it in the trivial way via
the exact state machine replica or in a more sophisticated ways (taking
advantage, for example, of the shortcuts bypassing any sub-optimality in the
state transition computation of the original system) so the statistical models
in such cases are merely the sub-optimal practical necessities.

Much of your discussion was about the imperfect models, the models capable of
predicting the performance only approximately or statistically. The question of
interest is whether the ELO model is the best one "practically" possible with
the present day technology. Although this is a bit fuzzy question (e.g. what is
"practical" and "practical" for whom) I think from the Uri's and my earlier
comments, it should be clear that the ELO predictor can be improved in many
ways, by including additional information as the input to the predictor. The ELO
model takes as its input very little information regarding the past games which
are practically (and very easily) available. In particular the plies of the game
are certainly available and reveal more about the player strengths (and
weaknesses) than the mere result. In my initial post I offered rough comparison
of how much more information is easily available this way: the ELO utilizes (at
most) 1.58 bits of information from the entire game, while the plies alone of a
typical game will contain couple hundreds times more information (about 5 bits
per ply). At least part of this extra information will have predictive value.

An example of a simple improvement would be not to attribute an autonomous ELO
to each player alone, but to all pairs of players. For example if a pair of
players A and B have played some number of games in the past, one could create
the pair AB ELO difference, D(A,B). Before the A and B ever played each other,
the D(A,B) could be simply Elo(A)-Elo(B). But once they have played each other,
the results of these games could be weighed higher (how much higher one would
have to empirically find out) than their results against others, resulting in
pair of numbers Elo(A,+B) and Elo(B,+A) and the D(A,B) would be the difference
of these numbers (specific to this pair of players). For N player pool the
resulting predictor model would contain not N ratings, as ELO does, but
N*(N-1)/2 pair ratings.

If one wishes to order players into the one-dimensional ranking list this could
be computed out of the pair ratings by computing first the predicted scores
among all pairs and using them as if a full round-robin torunament was played
and these predicted scores were obtained, then applying the usual torunament
rules for tie-breakers.

Further refinement of the pair rating method would be to use additional
classifications of all players by playing styles and favorite openings (with
some generic classification value for "unknown"). Then for a given player pair A
and B, we would also have style attributes for each, as vectors S(A) and S(B),
as well as all the past results of A or B against all other players Z with with
styles S(Z). One could then use all the past results of A but weigh them (for
the purpose of pair A-B specific pair rating) so that the result of A versus Z
is counted using greater weight the "more similar" style of Z is to the style of
B (and analogously for B). Of course, the operational meaning for "more similar"
would have to be determined empirically as some form of distance function
between the "feature vectors" S(Z) and S(B). The weight would be maximal when
the distance is 0 (i.e. when S(Z)==S(B), the special case of which is when
Z==B).

Therefore, this kind of model would be a generalization of the previous one.
Either model would be a better predictor of player's performance than the
one-dimensional ELO since it is a common knowledge that a player's performance
is sensitive to the type of the opponent. While in the pre-computer era this
type of pair modeling was impractical, today it is perfectly feasable. The style
vectors S() could be extracted by a program from the plies of the game via
suitable evaluation functions, and the opening repertoire would be trivially
available.

As to why we still use ELO, when predictions could be made more accurate --
there is no incentive, financial or any other, for stand-alone performance
predictor beyond ELO. Even ELO is an overkill since most chess games played in
the world are "non-rated" or "unofficial." If there were substantial money to be
made from predicting player performance better than ELO (say, some high stakes
betting), there would be any number of sophisticated models, as for example
there many for the stock market predictions. Chess is only a game with
relatively small financial stakes involved for the players, and even smaller
stakes in the performance predictions.

Now, the software companies making chess programs do have some incentive to
streamline their testing against the market competitors. The tuning against the
chief competitors, be it via killer books or parameter adjustement utilize ad
hoc models (human enhanced methods) which go beyond ELO in performance
prediction quality. The companies may hire strong chess players as advisors, to
examine closely the games against the chief opponents and suggest what stylistic
changes (e.g. parameter adjustments) or opening variations or endgame techniques
could improve the program's performance against a given opponent. Even the
amateur programmers use whatever chess knowledge they may have to tune their
programs after observing the games ply by ply.

In either case, if all one were taking into account here were the mere final
game results (the ELO-only predictor), the tuning process would be orders of
magnitude slower, since many games would have to be played with all possible
settings to find the optimal one. A human player examining the games may after a
mere handful of games advise an adjustment to king safety or pawn structure or
adding of some endgame technique... which will with high probability improve the
play much quicker than the blind trial and error with ELO only used (I'll label
this method "ELO-only blundering").

But this whole cycle of playing the program under human supervision and input,
then tuning the program's parameters, does represent a more sophisticated (and
more effective) model for predicting program's performance than plain ELO. The
plain ELO model based cycle would operate as a blind parameter adjustement
followed by a large number of games, followed by next adjustement, followed by
large number of games,... blundering blindly through the vast many-dimensional
space of program parameters (which also includes opening variation choices and
endgame techniques, in addition to narrow evaluation function parameters). The
human "tuner" will often cut shortcuts toward the better performance through
this parameter space by using the ply to ply information and his chess
knowledge, all of that input completely invisible to the ELO model.

By suggesting improvements the "tuner" is actually making a prediction about the
performance of the adjusted program, saying, for example, that the new program
which consists of the current program plus these opening lines removed or added,
or this endgame technique or table-base added will perform better than the
previous one. And for many types of tuning suggestions (especially those dealing
with obvious holes in the program's knowledge) this type of prediction will be
correct, the new program will play better. The ELO model can either be used to
give the "new" program the default "unknown" rating (if it is treated as "new")
or it could treat it as the "old" and leave the rating unchanged. In any case,
the plain ELO model will mostly perfrom worse as a predictor of the "tuned"
program than the (computer assisted) human tuner will.


Revisiting now the human-comp or human-human modelling, the role of performance
"tuner" for a human player is a chess coach and the player himself. Again, the
same reasoning as for the program tuning applies -- a coach (or the player using
the books & programs) may suggest any number of changes to his play which would
with good odds improve his play. If the "tuning" were left to plain ELO model,
the player would have to blindly try any change then play many games, then check
the ELO, then try other changes, then play many games,... etc. And all
throughout this blundering one would keep changes which increased the ELO and
back off from changes which decreased it. Nobody would ever move ahead this way.
It is precisely the intelligent use of the great deal of additional information
produced in the player's past games (beyond just the final game result) which
makes possible improvements within human lifetime. And the "intelligent use"
which streamlines greatly the ELO-only based blundering means being able to
predict performance (in the form of the change of perfromance from the present
one) better than plain ELO could. ELO model in fact doesn't predict a change in
performance as correlating with anything in particular.

Since most humans can be (self-) coached to improve, it means that the
successful coaching method which performs better than the random ELO-only
blundering, does necessarily include in its cycle a performance predictor which
can correlate multitude of factors with performance (in a form of performance
change correlating with some playing reasoning change), the factors which are
completely invisible to the plain ELO model.


In conclusion, this demonstrates that better than ELO models for predicting
chess performance, be it for human or for computer players, do exist and are
actually used all the time as the necessary (albeit not commonly emphasized
under "predictor" label) cog of the tuning/training procedures. If there were
some significant incentives in the pure prediction quality, decoupled from the
playing improvement objectives, the same methods would have been reformulated or
extracted into the stand  alone performance predictor substantially
outperforming ELO (as it does when used as a cog).
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.