Author: Tony Hedlund
Date: 07:23:54 11/26/02
Go up one level in this thread
On November 26, 2002 at 07:40:42, James T. Walker wrote: >On November 26, 2002 at 01:54:11, Stephen A. Boak wrote: > >>On November 25, 2002 at 17:18:38, James T. Walker wrote: >> >>>On November 25, 2002 at 15:51:44, eric guttenberg wrote: >>> >>>>What you say may make the list incomplete, in that some programs don't >>>>get tested on the faster hardware, but that doesn't make it inaccurate. >>>>Deep Fritz on 1200mhz hardware IS a lot stronger than Chessmaster 8000 >>>>on 450 mhz hardware. >>>> >>>>eric >>> >>>Can you prove your statement above that it "doesn't make it inaccurate"? >> >>There is a rule, well known in legal circles: You can't prove a negative. That >>makes your challenge rhetorical only. > >Of course it was rhetorical. What is your point? >> >>I think the statement was 1) an opinion, and 2) based on logic alone. I believe >>it is valid on both bases. >> >I think not. Where is the logic? > >>Beyond this (I'm not trying to nitpick or quibble), if you don't read into the >>ratings more than the SSDF method provides, you will get more mileage out of the >>SSDF ratings with less concern. >> >Nitpicking is exactly what you are doing by going over any post line by line and >trying to tear it apart. Your next sentence is my exact point. Exactly. "Don't read into the ratings more than the SSDF method provides." People just don't understand the list >Don't read too >much into the SSDF ratings since they may not be as accurate as many people here >would like to believe. Please define "accurate" in this matter. > >>>I still believe that computer/computer games exagerate the difference in chess >>>programs ratings. >> >>From whence comes the 'scale' you internally use to assess that the SSDF scale >>is 'exaggerated'? Which scale is 'right'? Why is any one scale more 'right' >>than another? I'm just philosophizing--no answer is desired. These are my own >>rhetorical questions. > >Rhetorical in your mind maybe but to me it is the crux of the matter. First of >all I don't have an internal scale. Here is a question which is not rhetorical. > Why has the SSDF made several "adjustments" to their ratings list over the >years? To calibrate it to human ratings. >Second question. Why has the adjustment always been downward? Not always. We have done it upwards also. >Third >question. What would the rating of Fritz 7 be today without those adjustments? >Fourth question. Why were the adjustments necessary? > >> >>If the rating given to comp games is based on comp vs comp, then the scale is >>simply what it is. It would still give a relative comparison between >>comps--based on the comp-comp scale. >> >>Are you trying to compare the SSDF comp-comp scale with a human-human scale? >>Why would you do that, if in fact the scales are different? Or, more to the >>point, why would you want one scale to be exactly the same as the other, when >>the pools are entirely different in the first place? >> >>I realize that many chess players throughout the world are familiar with >>human-human rating systems, many or most based on the ELO system. I also think >>we typically want to assess the 'true' strength of a comp against ourselves, >>i.e. against the human race. This is how we humans take the measure of >>something else--by comparing it against ourselves. >> >>Nothing inherently wrong with this, but it sometimes leads to 'forced fit' >>comparison situations that are more ridiculous than the simple observation that >>some things are not the same as other things. Is an Automobile better than a >>Human? By how much? What is the proper rating scale to compare the two? >>[maybe we are talking about simply travelling/running at high speed; maybe we >>are talking about how long such a thing lives/lasts; maybe we are talking about >>which is more intelligent]. >> > >>>If that's true then it logically follows that playing one >>>computer on 450Mhz vs one on 1200Mhz will also exagerate the difference even >>>more. >> >>I don't see the logic. >>I don't see the exaggeration. >>You would have to explain your personal scale, first. > >The logic is in the SSDF history. The adjustments made by SSDF were because of >the exageration. I don't have a personal scale. (I don't need one) > >>It is logical to expect that a faster computer will produce longer & deeper >>analysis (more nodes, better evaluation). If a test is run between a slow >>computer & a fast computer, the math used to calculate a rating should take that >>into effect. The ELO system does take that into consideration--even if it isn't >>the only way, nor a perfect way, of creating relative ratings. >> >>I mean that if one computer beats another computer 1000 to 200, using the same >>computer speeds, then the relative rating will be different than if the same >>computer beats the same program 1000 to 200 on processors that are different by >>a factor of two. >> >>The ELO scale (SSDF is based on this, generally speaking) takes in to account >>the fact that a given result (xxxx vs. yyyy) or score implies an expectation of >>results & a relative rating difference that varies, depending on the rating >>difference of the opponents. >> >>If you beat an opponent 50 rating points below you by a margin of 2 to 1, you >>will gain or lose a different amount of ELO points than if you beat an equally >>rated player by the same margin, or a higher rated player by the same margin. >>You see, the 'scale' by which ratings are assigned or moved up & down, varies >>depending on the initial relative difference of the two opponents. >> >>Since a doubling (or whatever) of processor speed is considered to be roughly >>equal to +50 points in relative rating, and most statistical measurement scales >>based on the normal bell curve work relatively accurately when the things being >>measured fall closer together (toward the center of the scale) rather than >>farther apart (toward the extreme side or sides of the measured population), >>then the ELO method applied by SSDF to comp-comp tests is relatively accurate >>for programs that are relatively close in strength, even when played on CPUs >>that vary in speed by a factor of 2 (since that is merely an induced 50 point >>approx delta due to CPU alone). >> >>Did you know that the ELO scale created by Arpad Elo was designed intentionally >>by him with the following principle--that a given point difference between two >>opponents, no matter where they fall on the overall rating scale, means that the >>result expectation (probability) of the higher or lower player winning, drawing >>or losing is identical. [perhaps I didn't word this the best] >> >>You would do well to study some statistics (basic texts are fine). [I'm not >>looking down my nose at you. You don't have to study/know statistics, but if >>you did, you might appreciate the math of statistics in a better way, and thus >>get more out of certain testing results by understanding the underlying >>statistical premises & calculations.] > >Maybe you should go back to school and do some more studying of your own. >(Since 1200 Mhz vs 450 Mhz is not exactly a factor of 2) See, anybody can be >nitpicking. It doesn't take a math major. I have enough statistics in my >background to understand the Elo system and some of it's weakness. It's not >necessary to have any math background to form my opinion which is based more on >the history of the SSDF and their testing procedures. By the way, you are >looking down your nose. I don't think you are looking at me though. You don't >know me or anything about me or my background in statistics but since you've >probably had a couple of classes you assume you know more than anyone else and >thus your lecture on the Elo system and statistics. (Sorry for the long >sentence. I'm not very well educated. I'm a high school drop out in fact.) > >> >>If you want to compare comp vs comp, then you should compare by doing comp-comp >>tests--exactly what the SSDF is doing. If the resultant scale or relative >>differences are not to one's liking, that does not mean they are 'exaggerated'. >>They are what they are. There is no better way to test comp vs comp for >>relative ratings than testing by comp-comp games. >> >>I have seen Jeff Sonas articles pointing out what he says are the flaws in >>human-human rating systems that are ELO based. He may be right. He thinks that >>relative human-human ratings are distorted more or less at different places on >>the ELO results (ratings) scale. >> >>I grant that he is correct in his opinion--but I don't know if that >>automatically makes the ELO system applied to comp-comp games conducted by the >>SSDF distorted in the same manner. >> >Who said it was? > >>In fact, I think the opposite might be true. SSDF doesn't *only* play top >>programs against top opponents on top speed CPUs. This avoids some of the >>'elite' bias that Sonas has pointed out in FIDE ELO application to human-human >>competition. Thus the distortion of ELO applied to human-human ratings may be >>less so when applied to comp-comp testing (for example, as done by the SSDF). >> >>>The SSDF is NOT a scientifically validated test. >> >>It is an axiom of proper scientific experimentation that one presents the >>description of the test as clearly & thoroughly as possible, then the results, >>also as clearly & thoroughly as possible. Then the test & the results speak >>entirely for themselves (no bias added by the tester). >> >>Then the reviewer (other scientist or even a layman) makes up his own mind >>whether the results of the test are indicative of something more than the mere >>results provide. The confidence level (opinion level) of the reviewer may vary, >>indeed does vary, according to the personal opinions & biases & knowledge of the >>reviewer. >> >>A test is *never* scientifically validated to the nth degree. It may be >>repeatable and allow certain inferences to be more or less confidently claimed, >>but it is never absolute *proof* nor *proven*. Especially when it comes to >>using the results of testing (ratings) to predict the future--no rating system >>does predict the future perfectly, nor will one ever be able to do so. >>Therefore, back to the question--what scale do you use, or do you want to use, >>and how would you pick such a scale to be the 'normative' one against which with >>the accuracy of another scale (say the SSDF ELO one) is accurately measured? >>Picking an arbitrary scale (not itself *scientifically validated*), i.e. that >>isn't calibrated, can only lead to improper inferences--either wrong inferences >>or ones that have the wrong weight (confidence level) attached to them. >> >>If you stretch the inferences, then the confidence level should go down. If you >>remain within bounds of the test, then you don't interpret too much into the >>results (without dismissing them entirely--after all, data is data!) and your >>confidence level is a bit greater in using the data to make inferences >>thereafter (predict future results, or assess which program is truly strongest). >> >>>In fact what the other >>>poster says may in fact make it more accurate than it is but still not perfect. >> >>>It's not to say that the SSDF is not doing a good job. >> >>SSDF is doing a good job--better than most individuals could ever do--testing >>many programs on many platforms against many other combinations of >>programs/platforms to achieve relative ratings based on many, many games. >> >>>It's just that maybe it >>>could be better with a little organization. >> >>How would you 'organize' SSDF better? > >I would start by telling each tester exactly how and what to test. Thoralf is doing that. >I would >publish every single game so that the results could be verified by anyone >interested. I'm sorry but just having a bunch of volunteers doing their "own >thing" and reporting the results with error bars as if they were a scientific >fact is not the best way. And the best way is...? BTW, the error bars IS a mathematical fact is it not? >It may be better than anything else we have. But it >is still not as good as it could be with some organization. It couldn't be done without organization. Tony > >> >>>Jim >> >>Thanks, >>--Steve
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.