Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: an idea for an experiment about rating

Author: Stephen A. Boak
Date: 22:45:47 04/07/05
On April 07, 2005 at 11:29:03, Daniel Pineo wrote:

>On April 06, 2005 at 19:58:17, Uri Blass wrote:
>
>>take a chess program.
>>
>>Your target is to find the difference in  rating between the program and a
>>program that plays random moves.
>
>That's actually a good way to define an elo system that isn't strictly relative
>like the one we have now.

Good way to define an elo system??  Could there be any _worse_ way? !!

[I guess there could.  The baseline program may be one that intentionally makes
as many losing moves as possible (plays 'give away' or 'Loser's' chess variant;
enables Helpmates wherever possible).]

Time to study some important books (Elo description, and general theory of
statistics).

Good luck to you and Uri!  And a whole lot of time!  I think both of you will
need the 'help'.

Here are my 'random' thoughts  :)
Please excuse any confusion they may bring.

The suggestion has several _BIG_ problems.

1. All Elo systems are 'relative'.  This one attempts to measure strength
(relatively!) against a random move 'program'.

IMO, it will largely be unsuccessful in its goal.

Any realistic output will occur in the most trivial of situations--where the
measured programs produce random moves a large part of the time (but not all the
time).  Thus the results will provide no meaningful measure for typically
stronger (always non-random moving) beginning programs.

2. The real power of an Elo system to measure relative strengths of programs is
predicated on the program to be rated being tested (competed) against many other
already rated members of a player pool.

Measuring each program individually against a single baseline player is
ridiculous.

It may produce a numerical score (hence rating figure) for each program competed
against the baseline random move generator, but it will in no way accurately
rank the programs to predict how they will perform against each other (or
against a human).

Why?  Because each rated program has not been rated against multiple rated
players in the same pool, but only against a static, single opponent (the random
move generator), which is a different (and trivially small) pool of its own.

3. As with all statistically based measurement systems, the measured rating is
most accurate when the player plays a range of opponents within, say, one or two
standard deviations of his own rating.

When a non random-moving program (say, a bean-counter program only) plays a
random-moving program, I would guess that the random moving program would
*never* win a game ... because its strength is so far, far beneath that of
virtually all non-random programs created with the goal of winning chess games.
I could be wrong ... but ....

[I expect Uri to quickly contradict this prediction by pointing out that even a
non-random moving program can have a bug in it, and lose by freezing up, or
falling into a 1-move mate vs. a random-mover program.  Go ahead, say it!  I'm
ready for it--that's a trivial observation undeserving of comment which does not
disprove the underlying point being made.]

4. I predict that any basic program that generally plays non-random moves (even
a simple bean-counting program that merely counts material and provides for no
positional evaluation), will earn an astronomical rating, versus the random move
generator.

It wouldn't surprise me if the top 100 programs currently in use could achieve
1000 or 10000 or even 100000-to-1 win ratios vs. the random-mover.

5. Let A & B each play R (a random-move program) 100,000 games.  Say A scores
99,999 to 1.  Say B scores 99,998 to 2.

What ratings might the suggested process produce for A & B.

Would those ratings in any way predict how well A would play against B?

Why or why not?

[Answer:  A resounding 'No!']

>IE define elo=0 to mean totally random play.  Then
>define elo=1 to be the strength of a program that loses to elo=0 1 out of 10
>times.  Define elo=2 to lose to elo=1 1/10 of the time, etc.  So the probability
>of player A losing to player B is 1 in 10^(eloA - eloB)
>
>Then no one would argue about whether a 2600 player of today could beat a 2600
>player of the 1920's because playing strength is always be measured relative to
>the fundamental standard of random play.

6. Wrong.  When the fundamental standard is so utterly low, 2600 players will
*never* obtain ratings that accurately predict long term chances against each
other.

7. Let's say 1,000 GM's play a thousand games each against beginning scholastic
[human] student(s) who can only make random (legal) moves.  Will the student(s)
ever score a game?  [NOTE--per stipulation, all random-moving, beginning
students have their 'learn' functions turned off!]

Probably not.  :)  Because the relative difference in strengths of the thing to
be measured (the GM, very high in strength) versus the standard random-mover
baseline (very low strength, if any!).

To put a figure to the student(s) probabilities (individually or collectively),
0.00000000000000001 or worse.

So probably each GM wins 1000 and loses 0 vs. the random mover.

Which GM is the best?  Which GM is the worst?

Substitute a random-moving program for the random-moving human beginner
opponents.

Will the results of the experiment provide any better relative ratings among the
programs to be rated?

8. The thought experiment or suggestion may lead to the idea that measuring new
programs against a fixed strength (but non-random) program *could* potentially
help rank the new programs.

What do you think the SSDF does, when it takes the latest program releases and
plays them very carefully on the same old, slow PCs, against the same older (but
fixed strength) programs with already established ratings!  If the rating deltas
are not too large, the older programs can help establish a relative rating for
the new releases.

This avoids the objection (usually) of playing an opponent of far higher or
lower relative strength in order to generate reasonably accurate relative Elos
for new programs.

9. Even better, the SSDF plays new programs against many rated programs in the
SSDF pool, including _several_ older programs with established ratings!

This avoids the objection of playing a single (rated) opponent only.

10. Let me know when you have enough games played.  I'd be happy to compute the
Elo ratings for you.  What you do with the figures after that is beyond me.  I
would have no use for such 'ratings'.  :)


Regards,
--Steve


>
>Dan Pineo
Re: an idea for an experiment about rating Daniel Pineo 06:56:51 04/08/05
Re: an idea for an experiment about rating Uri Blass 01:06:05 04/08/05
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.