Author: Stephen A. Boak
Date: 20:00:11 07/29/00
Go up one level in this thread
On July 29, 2000 at 18:13:32, Ratko V Tomic wrote: >>>This model is wrong and does not use the results of the games correctly to get >>>the best estimate. >> >> The model can't be wrong. It does what it purports to do. >> It does that well. > >The statistical model (memoryless random process) which ELO computation >assumes to undelie the variablity in results is certainly not the >accurate model for that variablity. I do not claim that the ELO model is the most accurate. But is not highly inaccurate. In fact, it is the opposite--it is very accurate. Any scientific statistical model is fine, as long as you have some mathematically meaningful error measures to assess its accuracy. ELO predicts results extremely accurately (not in specific, individual or handfuls of games, but overall, for thousands of rating pool members, tournament after tournament. A-class players finish above B-class players and below Expert-class players, over and over and over. ELO is self-correcting (over time) since ratings will be adjusted periodically, after a series of recent results. I have always said ELO lags performance, but that is nothing astounding--it works both ways, for rising players as well as falling players. The problem of adjusting ELO ratings to match changing results exists for any rating system, no matter how simple or complex. The most accurate model would be >the replica of the player itself. With chess programs one can make such >exact model, not with humans. We are not talking about a model for predicting the exact move a program would make (although you can desire one), but a model for predicting the results of a program (win-loss-draw, ratio of points gained) versus opponents. The computer will play its move (per itself)--just let it play. Test it and you will see. That doesn't determine the result of any particular game with any great accuracy. A game is full of many moves, any one of which might be a loser or a winner, or a choice of indeterminate value. Both humans and computers have evaluation problems and horizon problems. We haven't solved 'chess' you know. Therefore knowing the exact moves a computer will play (assuming it is exhaustively possible) will not in any reasonable manner determine the outcome of a match between comp-comp let alone comp-human. You misunderstand the nature of natural variation, applied to these circumstances, i.e. for testing and measurement of chess skill. 1. All testing involves natural variation. No two tests are exactly the same--ever. 2. This is self-evident in the chess world, where humans perform a bit better or worse than their ratings would predict (and sometimes the same, yes). Humans vary. In choice of openings, plans in the position, in how much risk to take in what kinds of positions. The notion of a single rating is a bold one, but highly reasonable when natural variation is assumed and accounted for. 3. Comp-human games have variation. Partly due to human variability and the inexactness of assigning a human rating. Partly due to programs that may not play the same move in every identical position (due to indeterministic programming in some cases, due to varying hash table contents when arriving at similar positions via different move orders, due to the number of program clock cycles (use of thinking time) differing from game to game and move to move--including variancy in time taken by the human to reply. 4. Comp-comp games have variation. Many have learners, crude though they may be, so as to not repeat lines lost previously. It would take many games to exhaust comp-comp variations, since most have large and different books, and varying styles will lead to different byways being taken in the large majority of games (I know sometimes there is an occasional duplicate game). 5. If this variability did not exist, we might not be having these discussions about improved rating systems. The statistical model for variability >used in ELO is the simplest nontrivial statistical model. Simple statistical models are preferred over complex statistical models when the predicative value is adequate. This is textbook theory--never create a model containing excess factors if there is a simpler one available that is adequate to the task. Of course, we have to determine for ourselves what is our task, then we can select from competing models the one we think accomplishes it the best. >Of course, what you mean by "the model can't be wrong" is that one >can apply correctly the inaccurate model Models are not right or wrong. If they have any predicative power, then they have some usefulness. Show me the model that predicts results better than ELO. When you do (not simply by hypothesizing about the possibility of creating one), we surely will undergo comparative testing (using statistics, of course--which we will have to agree on!) to validate the claim. It is possible, of course. But that doesn't mean the ELO system doesn't by and large, with great accuracy, fulfill its function--make the predictions it was designed to predict. in the sense of recognizing >its incorrectness. That doesn't mean it reflects the process it models >accurately or even well or better than any other model. > >The Uri's assertion is that the model ELO uses for mimicking variability >in results is by no means accurate (corresponding to the actual results) You banter the word accurate as though models are 'right' or 'wrong', 'accurate' or 'inaccurate'. All reasonable models of 'complex behavior and results' will be, a priori, more accurate in some circumstances, less accurate in others. You know that. >and it isn't even the best one in present day and age. I don't doubt that there are other models possible. I have heard of some and read descriptions of some. I haven't yet seen any math or testing to show another model is 'better' than the basic ELO sytem. (although I freely admit it is hypothetically possible--and may even exist today) As he suggested, >one could in principle write a program which could extract much more >information from the game (e.g. via analysis and scoring of each ply) >than ELO model does and be more accurate in predicting results than ELO. In principle, we can improve any model. In practice, we can improve any model (as a generality). What is new or exciting about this observation? I don't deny it. Program A could play the world's worst endgames--where it would lose to any chess beginner of any level. Yet if the computer plays the opening and middle games with some skill, it may not always reach an endgame. How does that added information allow better results predictions? It is possible, certainly, but the assessment of how often will it get to a 'lost' endgame must be made. How does one assess that? By playing many complete games against many different opponents and extracting the minimum necessary bits of data--the win-loss-draw results of the games. We are full circle, back to the basics that have to be included. Complex models will have to rely on the basic results data to be developed, tested, and modified. Ratings based on complex models will have to conform (be adjusted) to reality--basic results: win-lose-draw and ratios. >The ELO model uses about 1.58 bits of info for the entire game. The >strength analyzer program Uri mentioned would use about 5 bits of info >per ply, or hundreds of times more info per game about the process >than the ELO does. More info does not necessarily mean better predictions. If the gathering or management of the increased info makes the modeling impractical, you have to toss it out. Ruthlessly. Or you will bog down. Designing models for simplicity is a background goal, for all modelers. > >Of course, ELO method is from the pre-computer era, devised to do best >one can assuming: > >1) evaluator need not know anything about the chess >2) evaluator need not use anything beyond the slide rule or > log tables and pencil and paper to compute quickly > the ratings and the predictions. Wrong emphasis, to minimize the power of the ELO method by the claim that it is old fashioned. These are 'restrictions' of science, both old and new (which you support--I've read many of your postings on many topics). Repeatable tests with repeatable conclusions, under controlled circumstances. Unbiased measurements (results, not subjective quality grades on individual moves). Unbiased means even an unknowledgeable person could conduct the test according to the established criterion--and get the same result. This is one 'test' of science--that it is repeatable, where the tester is not 'special' or 'biased', causing the results to lean one way or another. Yes, I know some statistical models use fuzzy information, or management opinion, to improve the modeling. I don't count out that possibility. But this is not always repeatable, when we are judging chess moves. As pointed out, GM opinions may differ (this is what a chess game between GMs is all about--different ideas as to move selection that will optimize the bottom line result). We have a tendency (understatement) in science to believe unbiased rating estimates, not those that might vary due to the abilities of the particular expert (a GM?) whose input is relied on by a model for some information (his opinion). > >If you drop either or both of these upfront restrictions (which are an arbitrary >historical accident of what technology was available at the time) you can do >much better in terms of predicting the outcomes from the previous games. A >trivial example of such improved prediction for comp-comp play would be to run a >simulation of the programs on a much faster computer than what they would play >in a competition for which you're trying to predict an outcome. Such model is >obviously better than ELO rating. Come on, Ratko, this is indeed a trivial example, and beneath mention! Don't grasp at straws with this banal hypothetical. You can do better than that. A. You just proposed one argument that counters your philosophical points. In essence you have just said that to predict results better than ELO (which is based on results), if you can collect results faster (by modeling on a faster CPU), then you will have more accurate ratings. heh heh! B. You indicate if games can be played faster, by computer, then you can arrive at more results and therefore an improved confidence level (smaller standard deviation) for ratings. (so we need more ELO type results gathering to improve ELO--what else is new!) C. You casually skip the step (even though you like to discuss hypothetically improved models) of how you would model human play on a faster CPU--that is our key point isn't it, central to our overall CCC debate: Is a program better than a human, and if so, by how much; and what about another program, does it play humans better or worse, relative both to humans as well as the other program. The beauty of the ELO system is that it was constructed to be a scientific way of establishing meaningful (not perfect!) ratings, unbiased ratings (a key point by A. Elo), good predicative ratings (predicative of results). It is that, and very accurate as a generality; and thus far no better alternative has been proven (but it is possible, I grant). The search for a better rating system is a fine one, but it needs to get past the hypothetical stage to practical examination of how it will work, and the science/math it will use, otherwise it is a dream unfulfilled. And it should be measured against the de facto standard of the times (the ELO system) to show where, when and how it will furnish better ratings, and better predicative results.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.