Author: James T. Walker
Date: 04:40:42 11/26/02
Go up one level in this thread
On November 26, 2002 at 01:54:11, Stephen A. Boak wrote: >On November 25, 2002 at 17:18:38, James T. Walker wrote: > >>On November 25, 2002 at 15:51:44, eric guttenberg wrote: >> >>>What you say may make the list incomplete, in that some programs don't >>>get tested on the faster hardware, but that doesn't make it inaccurate. >>>Deep Fritz on 1200mhz hardware IS a lot stronger than Chessmaster 8000 >>>on 450 mhz hardware. >>> >>>eric >> >>Can you prove your statement above that it "doesn't make it inaccurate"? > >There is a rule, well known in legal circles: You can't prove a negative. That >makes your challenge rhetorical only. Of course it was rhetorical. What is your point? > >I think the statement was 1) an opinion, and 2) based on logic alone. I believe >it is valid on both bases. > I think not. Where is the logic? >Beyond this (I'm not trying to nitpick or quibble), if you don't read into the >ratings more than the SSDF method provides, you will get more mileage out of the >SSDF ratings with less concern. > Nitpicking is exactly what you are doing by going over any post line by line and trying to tear it apart. Your next sentence is my exact point. Don't read too much into the SSDF ratings since they may not be as accurate as many people here would like to believe. >>I still believe that computer/computer games exagerate the difference in chess >>programs ratings. > >From whence comes the 'scale' you internally use to assess that the SSDF scale >is 'exaggerated'? Which scale is 'right'? Why is any one scale more 'right' >than another? I'm just philosophizing--no answer is desired. These are my own >rhetorical questions. Rhetorical in your mind maybe but to me it is the crux of the matter. First of all I don't have an internal scale. Here is a question which is not rhetorical. Why has the SSDF made several "adjustments" to their ratings list over the years? Second question. Why has the adjustment always been downward? Third question. What would the rating of Fritz 7 be today without those adjustments? Fourth question. Why were the adjustments necessary? > >If the rating given to comp games is based on comp vs comp, then the scale is >simply what it is. It would still give a relative comparison between >comps--based on the comp-comp scale. > >Are you trying to compare the SSDF comp-comp scale with a human-human scale? >Why would you do that, if in fact the scales are different? Or, more to the >point, why would you want one scale to be exactly the same as the other, when >the pools are entirely different in the first place? > >I realize that many chess players throughout the world are familiar with >human-human rating systems, many or most based on the ELO system. I also think >we typically want to assess the 'true' strength of a comp against ourselves, >i.e. against the human race. This is how we humans take the measure of >something else--by comparing it against ourselves. > >Nothing inherently wrong with this, but it sometimes leads to 'forced fit' >comparison situations that are more ridiculous than the simple observation that >some things are not the same as other things. Is an Automobile better than a >Human? By how much? What is the proper rating scale to compare the two? >[maybe we are talking about simply travelling/running at high speed; maybe we >are talking about how long such a thing lives/lasts; maybe we are talking about >which is more intelligent]. > >>If that's true then it logically follows that playing one >>computer on 450Mhz vs one on 1200Mhz will also exagerate the difference even >>more. > >I don't see the logic. >I don't see the exaggeration. >You would have to explain your personal scale, first. The logic is in the SSDF history. The adjustments made by SSDF were because of the exageration. I don't have a personal scale. (I don't need one) >It is logical to expect that a faster computer will produce longer & deeper >analysis (more nodes, better evaluation). If a test is run between a slow >computer & a fast computer, the math used to calculate a rating should take that >into effect. The ELO system does take that into consideration--even if it isn't >the only way, nor a perfect way, of creating relative ratings. > >I mean that if one computer beats another computer 1000 to 200, using the same >computer speeds, then the relative rating will be different than if the same >computer beats the same program 1000 to 200 on processors that are different by >a factor of two. > >The ELO scale (SSDF is based on this, generally speaking) takes in to account >the fact that a given result (xxxx vs. yyyy) or score implies an expectation of >results & a relative rating difference that varies, depending on the rating >difference of the opponents. > >If you beat an opponent 50 rating points below you by a margin of 2 to 1, you >will gain or lose a different amount of ELO points than if you beat an equally >rated player by the same margin, or a higher rated player by the same margin. >You see, the 'scale' by which ratings are assigned or moved up & down, varies >depending on the initial relative difference of the two opponents. > >Since a doubling (or whatever) of processor speed is considered to be roughly >equal to +50 points in relative rating, and most statistical measurement scales >based on the normal bell curve work relatively accurately when the things being >measured fall closer together (toward the center of the scale) rather than >farther apart (toward the extreme side or sides of the measured population), >then the ELO method applied by SSDF to comp-comp tests is relatively accurate >for programs that are relatively close in strength, even when played on CPUs >that vary in speed by a factor of 2 (since that is merely an induced 50 point >approx delta due to CPU alone). > >Did you know that the ELO scale created by Arpad Elo was designed intentionally >by him with the following principle--that a given point difference between two >opponents, no matter where they fall on the overall rating scale, means that the >result expectation (probability) of the higher or lower player winning, drawing >or losing is identical. [perhaps I didn't word this the best] > >You would do well to study some statistics (basic texts are fine). [I'm not >looking down my nose at you. You don't have to study/know statistics, but if >you did, you might appreciate the math of statistics in a better way, and thus >get more out of certain testing results by understanding the underlying >statistical premises & calculations.] Maybe you should go back to school and do some more studying of your own. (Since 1200 Mhz vs 450 Mhz is not exactly a factor of 2) See, anybody can be nitpicking. It doesn't take a math major. I have enough statistics in my background to understand the Elo system and some of it's weakness. It's not necessary to have any math background to form my opinion which is based more on the history of the SSDF and their testing procedures. By the way, you are looking down your nose. I don't think you are looking at me though. You don't know me or anything about me or my background in statistics but since you've probably had a couple of classes you assume you know more than anyone else and thus your lecture on the Elo system and statistics. (Sorry for the long sentence. I'm not very well educated. I'm a high school drop out in fact.) > >If you want to compare comp vs comp, then you should compare by doing comp-comp >tests--exactly what the SSDF is doing. If the resultant scale or relative >differences are not to one's liking, that does not mean they are 'exaggerated'. >They are what they are. There is no better way to test comp vs comp for >relative ratings than testing by comp-comp games. > >I have seen Jeff Sonas articles pointing out what he says are the flaws in >human-human rating systems that are ELO based. He may be right. He thinks that >relative human-human ratings are distorted more or less at different places on >the ELO results (ratings) scale. > >I grant that he is correct in his opinion--but I don't know if that >automatically makes the ELO system applied to comp-comp games conducted by the >SSDF distorted in the same manner. > Who said it was? >In fact, I think the opposite might be true. SSDF doesn't *only* play top >programs against top opponents on top speed CPUs. This avoids some of the >'elite' bias that Sonas has pointed out in FIDE ELO application to human-human >competition. Thus the distortion of ELO applied to human-human ratings may be >less so when applied to comp-comp testing (for example, as done by the SSDF). > >>The SSDF is NOT a scientifically validated test. > >It is an axiom of proper scientific experimentation that one presents the >description of the test as clearly & thoroughly as possible, then the results, >also as clearly & thoroughly as possible. Then the test & the results speak >entirely for themselves (no bias added by the tester). > >Then the reviewer (other scientist or even a layman) makes up his own mind >whether the results of the test are indicative of something more than the mere >results provide. The confidence level (opinion level) of the reviewer may vary, >indeed does vary, according to the personal opinions & biases & knowledge of the >reviewer. > >A test is *never* scientifically validated to the nth degree. It may be >repeatable and allow certain inferences to be more or less confidently claimed, >but it is never absolute *proof* nor *proven*. Especially when it comes to >using the results of testing (ratings) to predict the future--no rating system >does predict the future perfectly, nor will one ever be able to do so. >Therefore, back to the question--what scale do you use, or do you want to use, >and how would you pick such a scale to be the 'normative' one against which with >the accuracy of another scale (say the SSDF ELO one) is accurately measured? >Picking an arbitrary scale (not itself *scientifically validated*), i.e. that >isn't calibrated, can only lead to improper inferences--either wrong inferences >or ones that have the wrong weight (confidence level) attached to them. > >If you stretch the inferences, then the confidence level should go down. If you >remain within bounds of the test, then you don't interpret too much into the >results (without dismissing them entirely--after all, data is data!) and your >confidence level is a bit greater in using the data to make inferences >thereafter (predict future results, or assess which program is truly strongest). > >>In fact what the other >>poster says may in fact make it more accurate than it is but still not perfect. > >>It's not to say that the SSDF is not doing a good job. > >SSDF is doing a good job--better than most individuals could ever do--testing >many programs on many platforms against many other combinations of >programs/platforms to achieve relative ratings based on many, many games. > >>It's just that maybe it >>could be better with a little organization. > >How would you 'organize' SSDF better? I would start by telling each tester exactly how and what to test. I would publish every single game so that the results could be verified by anyone interested. I'm sorry but just having a bunch of volunteers doing their "own thing" and reporting the results with error bars as if they were a scientific fact is not the best way. It may be better than anything else we have. But it is still not as good as it could be with some organization. > >>Jim > >Thanks, >--Steve
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.