Computer Chess Club Archives


Search

Terms

Messages

Subject: Comparing Rating Lists: Method & ELO curve vs Normal Bell curve Caveat

Author: Stephen A. Boak

Date: 23:51:18 01/05/06

Go up one level in this thread



>>I scaled the rating so that every list had the same average than Kurt's list
>>when considering only engines present in both. I could put up the excel file I
>>used if you want.
>>It may be that for some lists the overlap with Kurt's list is small, so the
>>scaling is skewed. What method would you suggest using ?
>>
>>Best Regards
>>Maurizio
>
>I probably would have tried scaling the ratings like you did. It's strange that
>you used Kurt's list to scale all the results and yet that's the one that looks
>low to me! Maybe minimizing the "sum of squares" would be better than simply
>making the averages the same. I'd be happy to take a look at your excel file and
>let you know if I can think of anything better. In the meanwhile, I'll take a
>closer look at the chart to see if I'm missing something.
>
>Best Regards,
>Jeff

In my opinion, a bell-shaped curve is a bell-shaped curve (no fundamental
difference, mathematically, in the shapes of the curves).  Similar
(conceptually) to the fact that any parabola is the same as any other parabola,
if you understand that principle.  I.e. scaling, translation and rotation are
trivial differences which may be normalized easily with application of simple &
appropriate mathematical formulas.

Assuming chess programs in any particular pool have ELO ratings that do fall on
a bell-shaped curve (to the first order), then to compare the bell-shaped ELO
curves of two distinct pools (which have many overlapping members; even if their
books or testing setups are somewhat different), I would do the following:

1. Determine the individual bell-shaped curves within POOL A & POOL B solely for
the subset of overlapping members which exist both in POOL A and POOL B.

2. Normalize the centers of such bell-curves by adding or subtracting the
difference in means.

[NOTE--Pegging a single program (say Shredder 9 UCI) to an arbitrary starting
point (say 2750 rating) *could* be part of a useful normalization methodology,
somewhat equivalent to Step 2 under certain circumstances, but it should *not*
be adopted without first performing Step 3, below.

This is because the width of two curves is just as important to normalize as the
centers (means) of those two curves.  Selecting a single program to align in
rating neither handles the necessary means (centering) adjustment, nor the
necessary width (spreading) adjustment.]

3. Normalize the width (spread, in my vernacular) of the bell-curves by an
appropriate adjustment related to the difference in standard deviations of the
two distinct curves.  In practice, if one curve had SD = 200 ELO points and the
other curve had SD = 100 ELO points, then the ratio of SDs is 2-to-1.  The wider
curve should be appropriately narrowed by re-plotting the ELOs on both sides of
the wide curve to 1/2 the original distances from the mean.  Or, conversely, the
narrow curve should be appropriately widened by re-plotting the ELOs on both
sides of the narrow curve to twice the original distances from the mean.

[I think Maurizio applied Step 2, above, more or less; but I don't believe that
he has tackled Step 3 thus far, which, IMO, would be a big, worthwhile
improvement.]

ADDENDUM--It is possible that a least-squares measure could be applied to the
two normalized curves for overlapping members in POOL A and POOL B, in order to
tweak normalization parameters 2 & 3, above, to achieve a slightly improved
'best fit' (if desired and worth the trouble).  This curve fitting could be
easily done in a simple looping program, written in BASIC or virtually any handy
computer language.

4. For the non-overlapping members of POOL A, re-plot those ELOs using the same
normalization paramaters (means adjust & spread adjust) applied to the
overlapping members present in POOL A.  Do the same for non-overlapping members
from POOL B.

Voila!  We now have normalized bell-shaped curves, suitable for inspection and
comparison and use in translating ELO ratings from POOL A to POOL B, and
applicable to both overlapping members (to a high degree of accuracy) and
non-overlapping members (to a fairly high degree of accuracy, though perhaps not
as accurate as for overlapping members).

5. CAVEAT--An ELO curve is a special version of a bell-shaped curve.  It is
designed & applied so that equal ELO deltas, located anywhere on the curve,
represent equal relative strengths, which may be expressed also in equal
relative probabilities of scoring.

Example:  PLAYER E1 and PLAYER E2 are Experts 100 ELO points apart, at 2000 and
2100 spots on the overall scale.  PLAYER M1 AND PLAYER M2 are Masters 100 ELO
points apart, at 2200 and 2300 on the overall ELO scale.  PLAYER G1 and PLAYER
G2 are Grandmasters, also 100 ELO points apart, but at 2550 and 2650 on the
overall ELO scale.

Under the ELO curve designed by Professor Arpad Elo, E2 would score about 60%
versus E1, and M2 would score about 60% versus M1, and G2 would score about 60%
against G1.  The relative strengths of two players, expressed as a scoring
expectancy or probability, would be similar for each pair of players separated
by the exact same number of ELO rating points, no matter where the pair of
players is located on the overall ELO scale.

One would have to think carefully about the mathematics of this distinction
between the normal bell-curve and the ELO curve.  Some additional mathematics,
perhaps related to logarithms, might be required in above Step 2 and Step 3
normalization routines.  Mathematicians (dare I say most?) could easily confirm
the slight adjustment this would require in the normalization algorithms
mentioned above.  I suspect it would relate to a comparison of the ELO scale
(difference in rating) versus relative proportional areas (representing relative
probabilities) under the ELO curve.  I understand this adjustment intuitively,
but would have to work out the precise formulas by playing with the data and
comparing the ELO scale with a normal Bell-curve scales to poinpoint the
mathematical differences in curve 'shapes', areas, SDs, and probabilities.

6. Do not misinterpret my comments.  I did not say that ELO ratings are perfect
and non-variable.

Nor did I say that ELO ratings are always measurable with a great deal of
relative accuracy.

7. Per Arpad Elo, rating accuracy is best obtained by having a player play a
moderate number (huge number not necessary) of other members in a rating pool,
which opponents have *varied* ratings [above and below the strengths of the
players to be rated].

8. EXAMPLE--to establish initial ratings for unrated B1 and B2 players, and thus
to determine which is stronger, and by how much, relatively in POOL A, when both
are of roughly Category B strength--i.e. well below Expert, say 1600 to 1799 ELO
range--it would not work to have both of them play 1) solely each other; or 2)
even the same group of Grandmaster opponents ... even if a series of 20, 50 or
even 100 or 1000 games was played.

Both such initial rating methodologies have serious flaws.

A. Testing by playing only each other might establish a relative B1 & B2 single
pair ELO (strength) difference, but that figure would be of zero use to predict,
on the average, how either would fare against a random set of other members in
POOL A.

B. Testing B-class players only against an identical group of GMs, even in an
extremely long match, would show that each of them is far weaker than the GMs,
but those results would be of little use (even if not quite zero use) to
predict, on the average, how either would fare against a random set of other
members in POOL A.

Don't forget, the ELO system is designed to measure relative strengths within an
overall POOL OF OTHER PLAYERS.  Not relative just to a single specific opponent
or a single tournament game (although the underlying statistical rating basis is
normally quite accurate in the long haul)

Instead, it would be far better if B1 and B2 played a moderate number of
opponents (say 25 or so) of varied ratings (across the ELO scale).

Ideally, to initially rate B1 & B2 players fairly accurately, in POOL A, their
moderate number (say 25 or so) of initial opponents should have established ELO
ratings that fall within a suitable rating range, say, of approximately 1200 to
2200 ELO, even if not identical opponents for B1 and B2, in order to establish
their individual but relative initial ELO ratings & strengths ... WITHIN POOL A.

9. It is my belief, at the risk of receiving flaming arrow shots from all
quarters, that the above Method of Comparing Rating Lists would work rather well
to compare SSDF or other Comp-Comp rating lists with Human-Comp rating lists or
ratings derived from a number of unrelated Human-Comp match results (even if
separate Human-Comp match results, even if against many older program versions,
even if across several years).

More on this idea at another time ... if I survive long enough to further
address the issue.  :)

10. I hope my fire-proof kevlar vests arrive on time.  I may have to go with the
layered look, despite some unseasonably warm weather of late.  Of course the
recent heat is nothing compared to the near term future weather prediction I
have made for myself.  ;)

--Steve



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.