Author: Stephen A. Boak
Date: 22:45:47 04/07/05
Go up one level in this thread
On April 07, 2005 at 11:29:03, Daniel Pineo wrote: >On April 06, 2005 at 19:58:17, Uri Blass wrote: > >>take a chess program. >> >>Your target is to find the difference in rating between the program and a >>program that plays random moves. > >That's actually a good way to define an elo system that isn't strictly relative >like the one we have now. Good way to define an elo system?? Could there be any _worse_ way? !! [I guess there could. The baseline program may be one that intentionally makes as many losing moves as possible (plays 'give away' or 'Loser's' chess variant; enables Helpmates wherever possible).] Time to study some important books (Elo description, and general theory of statistics). Good luck to you and Uri! And a whole lot of time! I think both of you will need the 'help'. Here are my 'random' thoughts :) Please excuse any confusion they may bring. The suggestion has several _BIG_ problems. 1. All Elo systems are 'relative'. This one attempts to measure strength (relatively!) against a random move 'program'. IMO, it will largely be unsuccessful in its goal. Any realistic output will occur in the most trivial of situations--where the measured programs produce random moves a large part of the time (but not all the time). Thus the results will provide no meaningful measure for typically stronger (always non-random moving) beginning programs. 2. The real power of an Elo system to measure relative strengths of programs is predicated on the program to be rated being tested (competed) against many other already rated members of a player pool. Measuring each program individually against a single baseline player is ridiculous. It may produce a numerical score (hence rating figure) for each program competed against the baseline random move generator, but it will in no way accurately rank the programs to predict how they will perform against each other (or against a human). Why? Because each rated program has not been rated against multiple rated players in the same pool, but only against a static, single opponent (the random move generator), which is a different (and trivially small) pool of its own. 3. As with all statistically based measurement systems, the measured rating is most accurate when the player plays a range of opponents within, say, one or two standard deviations of his own rating. When a non random-moving program (say, a bean-counter program only) plays a random-moving program, I would guess that the random moving program would *never* win a game ... because its strength is so far, far beneath that of virtually all non-random programs created with the goal of winning chess games. I could be wrong ... but .... [I expect Uri to quickly contradict this prediction by pointing out that even a non-random moving program can have a bug in it, and lose by freezing up, or falling into a 1-move mate vs. a random-mover program. Go ahead, say it! I'm ready for it--that's a trivial observation undeserving of comment which does not disprove the underlying point being made.] 4. I predict that any basic program that generally plays non-random moves (even a simple bean-counting program that merely counts material and provides for no positional evaluation), will earn an astronomical rating, versus the random move generator. It wouldn't surprise me if the top 100 programs currently in use could achieve 1000 or 10000 or even 100000-to-1 win ratios vs. the random-mover. 5. Let A & B each play R (a random-move program) 100,000 games. Say A scores 99,999 to 1. Say B scores 99,998 to 2. What ratings might the suggested process produce for A & B. Would those ratings in any way predict how well A would play against B? Why or why not? [Answer: A resounding 'No!'] >IE define elo=0 to mean totally random play. Then >define elo=1 to be the strength of a program that loses to elo=0 1 out of 10 >times. Define elo=2 to lose to elo=1 1/10 of the time, etc. So the probability >of player A losing to player B is 1 in 10^(eloA - eloB) > >Then no one would argue about whether a 2600 player of today could beat a 2600 >player of the 1920's because playing strength is always be measured relative to >the fundamental standard of random play. 6. Wrong. When the fundamental standard is so utterly low, 2600 players will *never* obtain ratings that accurately predict long term chances against each other. 7. Let's say 1,000 GM's play a thousand games each against beginning scholastic [human] student(s) who can only make random (legal) moves. Will the student(s) ever score a game? [NOTE--per stipulation, all random-moving, beginning students have their 'learn' functions turned off!] Probably not. :) Because the relative difference in strengths of the thing to be measured (the GM, very high in strength) versus the standard random-mover baseline (very low strength, if any!). To put a figure to the student(s) probabilities (individually or collectively), 0.00000000000000001 or worse. So probably each GM wins 1000 and loses 0 vs. the random mover. Which GM is the best? Which GM is the worst? Substitute a random-moving program for the random-moving human beginner opponents. Will the results of the experiment provide any better relative ratings among the programs to be rated? 8. The thought experiment or suggestion may lead to the idea that measuring new programs against a fixed strength (but non-random) program *could* potentially help rank the new programs. What do you think the SSDF does, when it takes the latest program releases and plays them very carefully on the same old, slow PCs, against the same older (but fixed strength) programs with already established ratings! If the rating deltas are not too large, the older programs can help establish a relative rating for the new releases. This avoids the objection (usually) of playing an opponent of far higher or lower relative strength in order to generate reasonably accurate relative Elos for new programs. 9. Even better, the SSDF plays new programs against many rated programs in the SSDF pool, including _several_ older programs with established ratings! This avoids the objection of playing a single (rated) opponent only. 10. Let me know when you have enough games played. I'd be happy to compute the Elo ratings for you. What you do with the figures after that is beyond me. I would have no use for such 'ratings'. :) Regards, --Steve > >Dan Pineo
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.