Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSDF Computer Rating List

Author: Stephen A. Boak

Date: 22:54:11 11/25/02

Go up one level in this thread


On November 25, 2002 at 17:18:38, James T. Walker wrote:

>On November 25, 2002 at 15:51:44, eric guttenberg wrote:
>
>>What you say may make the list incomplete, in that some programs don't
>>get tested on the faster hardware, but that doesn't make it inaccurate.
>>Deep Fritz on 1200mhz hardware IS a lot stronger than Chessmaster 8000
>>on 450 mhz hardware.
>>
>>eric
>
>Can you prove your statement above that it "doesn't make it inaccurate"?

There is a rule, well known in legal circles: You can't prove a negative.  That
makes your challenge rhetorical only.

I think the statement was 1) an opinion, and 2) based on logic alone.  I believe
it is valid on both bases.

Beyond this (I'm not trying to nitpick or quibble), if you don't read into the
ratings more than the SSDF method provides, you will get more mileage out of the
SSDF ratings with less concern.

>I still believe that computer/computer games exagerate the difference in chess
>programs ratings.

From whence comes the 'scale' you internally use to assess that the SSDF scale
is 'exaggerated'?  Which scale is 'right'?  Why is any one scale more 'right'
than another?  I'm just philosophizing--no answer is desired.  These are my own
rhetorical questions.

If the rating given to comp games is based on comp vs comp, then the scale is
simply what it is.  It would still give a relative comparison between
comps--based on the comp-comp scale.

Are you trying to compare the SSDF comp-comp scale with a human-human scale?
Why would you do that, if in fact the scales are different?  Or, more to the
point, why would you want one scale to be exactly the same as the other, when
the pools are entirely different in the first place?

I realize that many chess players throughout the world are familiar with
human-human rating systems, many or most based on the ELO system.  I also think
we typically want to assess the 'true' strength of a comp against ourselves,
i.e. against the human race.  This is how we humans take the measure of
something else--by comparing it against ourselves.

Nothing inherently wrong with this, but it sometimes leads to 'forced fit'
comparison situations that are more ridiculous than the simple observation that
some things are not the same as other things.  Is an Automobile better than a
Human?  By how much?  What is the proper rating scale to compare the two?
[maybe we are talking about simply travelling/running at high speed; maybe we
are talking about how long such a thing lives/lasts; maybe we are talking about
which is more intelligent].

>If that's true then it logically follows that playing one
>computer on 450Mhz vs one on 1200Mhz will also exagerate the difference even
>more.

I don't see the logic.
I don't see the exaggeration.
You would have to explain your personal scale, first.

It is logical to expect that a faster computer will produce longer & deeper
analysis (more nodes, better evaluation).  If a test is run between a slow
computer & a fast computer, the math used to calculate a rating should take that
into effect.  The ELO system does take that into consideration--even if it isn't
the only way, nor a perfect way, of creating relative ratings.

I mean that if one computer beats another computer 1000 to 200, using the same
computer speeds, then the relative rating will be different than if the same
computer beats the same program 1000 to 200 on processors that are different by
a factor of two.

The ELO scale (SSDF is based on this, generally speaking) takes in to account
the fact that a given result (xxxx vs. yyyy) or score implies an expectation of
results & a relative rating difference that varies, depending on the rating
difference of the opponents.

If you beat an opponent 50 rating points below you by a margin of 2 to 1, you
will gain or lose a different amount of ELO points than if you beat an equally
rated player by the same margin, or a higher rated player by the same margin.
You see, the 'scale' by which ratings are assigned or moved up & down, varies
depending on the initial relative difference of the two opponents.

Since a doubling (or whatever) of processor speed is considered to be roughly
equal to +50 points in relative rating, and most statistical measurement scales
based on the normal bell curve work relatively accurately when the things being
measured fall closer together (toward the center of the scale) rather than
farther apart (toward the extreme side or sides of the measured population),
then the ELO method applied by SSDF to comp-comp tests is relatively accurate
for programs that are relatively close in strength, even when played on CPUs
that vary in speed by a factor of 2 (since that is merely an induced 50 point
approx delta due to CPU alone).

Did you know that the ELO scale created by Arpad Elo was designed intentionally
by him with the following principle--that a given point difference between two
opponents, no matter where they fall on the overall rating scale, means that the
result expectation (probability) of the higher or lower player winning, drawing
or losing is identical.  [perhaps I didn't word this the best]

You would do well to study some statistics (basic texts are fine).  [I'm not
looking down my nose at you.  You don't have to study/know statistics, but if
you did, you might appreciate the math of statistics in a better way, and thus
get more out of certain testing results by understanding the underlying
statistical premises & calculations.]

If you want to compare comp vs comp, then you should compare by doing comp-comp
tests--exactly what the SSDF is doing.  If the resultant scale or relative
differences are not to one's liking, that does not mean they are 'exaggerated'.
They are what they are.  There is no better way to test comp vs comp for
relative ratings than testing by comp-comp games.

I have seen Jeff Sonas articles pointing out what he says are the flaws in
human-human rating systems that are ELO based.  He may be right.  He thinks that
relative human-human ratings are distorted more or less at different places on
the ELO results (ratings) scale.

I grant that he is correct in his opinion--but I don't know if that
automatically makes the ELO system applied to comp-comp games conducted by the
SSDF distorted in the same manner.

In fact, I think the opposite might be true.  SSDF doesn't *only* play top
programs against top opponents on top speed CPUs.  This avoids some of the
'elite' bias that Sonas has pointed out in FIDE ELO application to human-human
competition.  Thus the distortion of ELO applied to human-human ratings may be
less so when applied to comp-comp testing (for example, as done by the SSDF).

>The SSDF is NOT a scientifically validated test.

It is an axiom of proper scientific experimentation that one presents the
description of the test as clearly & thoroughly as possible, then the results,
also as clearly & thoroughly as possible.  Then the test & the results speak
entirely for themselves (no bias added by the tester).

Then the reviewer (other scientist or even a layman) makes up his own mind
whether the results of the test are indicative of something more than the mere
results provide.  The confidence level (opinion level) of the reviewer may vary,
indeed does vary, according to the personal opinions & biases & knowledge of the
reviewer.

A test is *never* scientifically validated to the nth degree.  It may be
repeatable and allow certain inferences to be more or less confidently claimed,
but it is never absolute *proof* nor *proven*.  Especially when it comes to
using the results of testing (ratings) to predict the future--no rating system
does predict the future perfectly, nor will one ever be able to do so.
Therefore, back to the question--what scale do you use, or do you want to use,
and how would you pick such a scale to be the 'normative' one against which with
the accuracy of another scale (say the SSDF ELO one) is accurately measured?
Picking an arbitrary scale (not itself *scientifically validated*), i.e. that
isn't calibrated, can only lead to improper inferences--either wrong inferences
or ones that have the wrong weight (confidence level) attached to them.

If you stretch the inferences, then the confidence level should go down.  If you
remain within bounds of the test, then you don't interpret too much into the
results (without dismissing them entirely--after all, data is data!) and your
confidence level is a bit greater in using the data to make inferences
thereafter (predict future results, or assess which program is truly strongest).

>In fact what the other
>poster says may in fact make it more accurate than it is but still not perfect.

>It's not to say that the SSDF is not doing a good job.

SSDF is doing a good job--better than most individuals could ever do--testing
many programs on many platforms against many other combinations of
programs/platforms to achieve relative ratings based on many, many games.

>It's just that maybe it
>could be better with a little organization.

How would you 'organize' SSDF better?

>Jim

Thanks,
--Steve



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.