Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSDF Computer Rating List

Author: Tony Hedlund

Date: 07:23:54 11/26/02

Go up one level in this thread


On November 26, 2002 at 07:40:42, James T. Walker wrote:

>On November 26, 2002 at 01:54:11, Stephen A. Boak wrote:
>
>>On November 25, 2002 at 17:18:38, James T. Walker wrote:
>>
>>>On November 25, 2002 at 15:51:44, eric guttenberg wrote:
>>>
>>>>What you say may make the list incomplete, in that some programs don't
>>>>get tested on the faster hardware, but that doesn't make it inaccurate.
>>>>Deep Fritz on 1200mhz hardware IS a lot stronger than Chessmaster 8000
>>>>on 450 mhz hardware.
>>>>
>>>>eric
>>>
>>>Can you prove your statement above that it "doesn't make it inaccurate"?
>>
>>There is a rule, well known in legal circles: You can't prove a negative.  That
>>makes your challenge rhetorical only.
>
>Of course it was rhetorical.  What is your point?
>>
>>I think the statement was 1) an opinion, and 2) based on logic alone.  I believe
>>it is valid on both bases.
>>
>I think not.  Where is the logic?
>
>>Beyond this (I'm not trying to nitpick or quibble), if you don't read into the
>>ratings more than the SSDF method provides, you will get more mileage out of the
>>SSDF ratings with less concern.
>>
>Nitpicking is exactly what you are doing by going over any post line by line and
>trying to tear it apart.  Your next sentence is my exact point.

Exactly. "Don't read into the ratings more than the SSDF method provides."
People just don't understand the list

>Don't read too
>much into the SSDF ratings since they may not be as accurate as many people here
>would like to believe.

Please define "accurate" in this matter.

>
>>>I still believe that computer/computer games exagerate the difference in chess
>>>programs ratings.
>>
>>From whence comes the 'scale' you internally use to assess that the SSDF scale
>>is 'exaggerated'?  Which scale is 'right'?  Why is any one scale more 'right'
>>than another?  I'm just philosophizing--no answer is desired.  These are my own
>>rhetorical questions.
>
>Rhetorical in your mind maybe but to me it is the crux of the matter.  First of
>all I don't have an internal scale.  Here is a question which is not rhetorical.
> Why has the SSDF made several "adjustments" to their ratings list over the
>years?

To calibrate it to human ratings.

>Second question.  Why has the adjustment always been downward?

Not always. We have done it upwards also.

>Third
>question.  What would the rating of Fritz 7 be today without those adjustments?
>Fourth question.  Why were the adjustments necessary?
>
>>
>>If the rating given to comp games is based on comp vs comp, then the scale is
>>simply what it is.  It would still give a relative comparison between
>>comps--based on the comp-comp scale.
>>
>>Are you trying to compare the SSDF comp-comp scale with a human-human scale?
>>Why would you do that, if in fact the scales are different?  Or, more to the
>>point, why would you want one scale to be exactly the same as the other, when
>>the pools are entirely different in the first place?
>>
>>I realize that many chess players throughout the world are familiar with
>>human-human rating systems, many or most based on the ELO system.  I also think
>>we typically want to assess the 'true' strength of a comp against ourselves,
>>i.e. against the human race.  This is how we humans take the measure of
>>something else--by comparing it against ourselves.
>>
>>Nothing inherently wrong with this, but it sometimes leads to 'forced fit'
>>comparison situations that are more ridiculous than the simple observation that
>>some things are not the same as other things.  Is an Automobile better than a
>>Human?  By how much?  What is the proper rating scale to compare the two?
>>[maybe we are talking about simply travelling/running at high speed; maybe we
>>are talking about how long such a thing lives/lasts; maybe we are talking about
>>which is more intelligent].
>>
>
>>>If that's true then it logically follows that playing one
>>>computer on 450Mhz vs one on 1200Mhz will also exagerate the difference even
>>>more.
>>
>>I don't see the logic.
>>I don't see the exaggeration.
>>You would have to explain your personal scale, first.
>
>The logic is in the SSDF history.  The adjustments made by SSDF were because of
>the exageration.  I don't have a personal scale. (I don't need one)
>
>>It is logical to expect that a faster computer will produce longer & deeper
>>analysis (more nodes, better evaluation).  If a test is run between a slow
>>computer & a fast computer, the math used to calculate a rating should take that
>>into effect.  The ELO system does take that into consideration--even if it isn't
>>the only way, nor a perfect way, of creating relative ratings.
>>
>>I mean that if one computer beats another computer 1000 to 200, using the same
>>computer speeds, then the relative rating will be different than if the same
>>computer beats the same program 1000 to 200 on processors that are different by
>>a factor of two.
>>
>>The ELO scale (SSDF is based on this, generally speaking) takes in to account
>>the fact that a given result (xxxx vs. yyyy) or score implies an expectation of
>>results & a relative rating difference that varies, depending on the rating
>>difference of the opponents.
>>
>>If you beat an opponent 50 rating points below you by a margin of 2 to 1, you
>>will gain or lose a different amount of ELO points than if you beat an equally
>>rated player by the same margin, or a higher rated player by the same margin.
>>You see, the 'scale' by which ratings are assigned or moved up & down, varies
>>depending on the initial relative difference of the two opponents.
>>
>>Since a doubling (or whatever) of processor speed is considered to be roughly
>>equal to +50 points in relative rating, and most statistical measurement scales
>>based on the normal bell curve work relatively accurately when the things being
>>measured fall closer together (toward the center of the scale) rather than
>>farther apart (toward the extreme side or sides of the measured population),
>>then the ELO method applied by SSDF to comp-comp tests is relatively accurate
>>for programs that are relatively close in strength, even when played on CPUs
>>that vary in speed by a factor of 2 (since that is merely an induced 50 point
>>approx delta due to CPU alone).
>>
>>Did you know that the ELO scale created by Arpad Elo was designed intentionally
>>by him with the following principle--that a given point difference between two
>>opponents, no matter where they fall on the overall rating scale, means that the
>>result expectation (probability) of the higher or lower player winning, drawing
>>or losing is identical.  [perhaps I didn't word this the best]
>>
>>You would do well to study some statistics (basic texts are fine).  [I'm not
>>looking down my nose at you.  You don't have to study/know statistics, but if
>>you did, you might appreciate the math of statistics in a better way, and thus
>>get more out of certain testing results by understanding the underlying
>>statistical premises & calculations.]
>
>Maybe you should go back to school and do some more studying of your own.
>(Since 1200 Mhz vs 450 Mhz is not exactly a factor of 2)  See, anybody can be
>nitpicking.  It doesn't take a math major.  I have enough statistics in my
>background to understand the Elo system and some of it's weakness.  It's not
>necessary to have any math background to form my opinion which is based more on
>the history of the SSDF and their testing procedures.  By the way, you are
>looking down your nose.  I don't think you are looking at me though.  You don't
>know me or anything about me or my background in statistics but since you've
>probably had a couple of classes you assume you know more than anyone else and
>thus your lecture on the Elo system and statistics. (Sorry for the long
>sentence.  I'm not very well educated.  I'm a high school drop out in fact.)
>
>>
>>If you want to compare comp vs comp, then you should compare by doing comp-comp
>>tests--exactly what the SSDF is doing.  If the resultant scale or relative
>>differences are not to one's liking, that does not mean they are 'exaggerated'.
>>They are what they are.  There is no better way to test comp vs comp for
>>relative ratings than testing by comp-comp games.
>>
>>I have seen Jeff Sonas articles pointing out what he says are the flaws in
>>human-human rating systems that are ELO based.  He may be right.  He thinks that
>>relative human-human ratings are distorted more or less at different places on
>>the ELO results (ratings) scale.
>>
>>I grant that he is correct in his opinion--but I don't know if that
>>automatically makes the ELO system applied to comp-comp games conducted by the
>>SSDF distorted in the same manner.
>>
>Who said it was?
>
>>In fact, I think the opposite might be true.  SSDF doesn't *only* play top
>>programs against top opponents on top speed CPUs.  This avoids some of the
>>'elite' bias that Sonas has pointed out in FIDE ELO application to human-human
>>competition.  Thus the distortion of ELO applied to human-human ratings may be
>>less so when applied to comp-comp testing (for example, as done by the SSDF).
>>
>>>The SSDF is NOT a scientifically validated test.
>>
>>It is an axiom of proper scientific experimentation that one presents the
>>description of the test as clearly & thoroughly as possible, then the results,
>>also as clearly & thoroughly as possible.  Then the test & the results speak
>>entirely for themselves (no bias added by the tester).
>>
>>Then the reviewer (other scientist or even a layman) makes up his own mind
>>whether the results of the test are indicative of something more than the mere
>>results provide.  The confidence level (opinion level) of the reviewer may vary,
>>indeed does vary, according to the personal opinions & biases & knowledge of the
>>reviewer.
>>
>>A test is *never* scientifically validated to the nth degree.  It may be
>>repeatable and allow certain inferences to be more or less confidently claimed,
>>but it is never absolute *proof* nor *proven*.  Especially when it comes to
>>using the results of testing (ratings) to predict the future--no rating system
>>does predict the future perfectly, nor will one ever be able to do so.
>>Therefore, back to the question--what scale do you use, or do you want to use,
>>and how would you pick such a scale to be the 'normative' one against which with
>>the accuracy of another scale (say the SSDF ELO one) is accurately measured?
>>Picking an arbitrary scale (not itself *scientifically validated*), i.e. that
>>isn't calibrated, can only lead to improper inferences--either wrong inferences
>>or ones that have the wrong weight (confidence level) attached to them.
>>
>>If you stretch the inferences, then the confidence level should go down.  If you
>>remain within bounds of the test, then you don't interpret too much into the
>>results (without dismissing them entirely--after all, data is data!) and your
>>confidence level is a bit greater in using the data to make inferences
>>thereafter (predict future results, or assess which program is truly strongest).
>>
>>>In fact what the other
>>>poster says may in fact make it more accurate than it is but still not perfect.
>>
>>>It's not to say that the SSDF is not doing a good job.
>>
>>SSDF is doing a good job--better than most individuals could ever do--testing
>>many programs on many platforms against many other combinations of
>>programs/platforms to achieve relative ratings based on many, many games.
>>
>>>It's just that maybe it
>>>could be better with a little organization.
>>
>>How would you 'organize' SSDF better?
>
>I would start by telling each tester exactly how and what to test.

Thoralf is doing that.

>I  would
>publish every single game so that the results could be verified by anyone
>interested.  I'm sorry but just having a bunch of volunteers doing their "own
>thing" and reporting the results with error bars as if they were a scientific
>fact is not the best way.

And the best way is...?
BTW, the error bars IS a mathematical fact is it not?

>It may be better than anything else we have.  But it
>is still not as good as it could be with some organization.

It couldn't be done without organization.

Tony

>
>>
>>>Jim
>>
>>Thanks,
>>--Steve



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.