Author: Dann Corbit
Date: 18:21:56 01/23/04
Go up one level in this thread
On January 23, 2004 at 21:04:06, Rolf Tueschen wrote: >On January 23, 2004 at 20:38:01, Dann Corbit wrote: > >>On January 23, 2004 at 20:00:30, Rolf Tueschen wrote: >> >>>On January 23, 2004 at 18:33:52, Dann Corbit wrote: >>> >>>>On January 23, 2004 at 18:20:34, Russell Reagan wrote: >>>> >>>>>On January 23, 2004 at 15:24:31, Dann Corbit wrote: >>>>> >>>>>>30 experiments is a fairly standard rule as to when you should start to trust >>>>>>the results for experimental data. >>>>> >>>>>So what does this mean for chess engine matches? You need at least 30 games? Or >>>>>30 matches? If matches, how do you determine how long each match should be? >>>> >>>>It means less than 30 games and you cannot trust the answer. >>>>With more than 30 games, confidence rises. >>>> >>>>I bring up the number 30 because it is important in this case. If you run (for >>>>instance) a 15 game contest, it would be dangerous to try to draw conclusions >>>>from it. With 30 games or more, even something that does not perfectly model a >>>>normal distribution will start to conform to the right answers (e.g. the mean >>>>calculation will be about right. The standard deviations will be about right >>>>unless sharply skewed). >>>> >>>>30 games is the break even limit where deficiencies in the choice of a normal >>>>distribution as a model start to become smoothed over. >>> >>> >>>About what measurements you are talking here? Of course the N is right for >>>normally distributed variables but what do you "measure" with chess games? >> >>+1, -1, 0 > >First, thanks for the night answer. So I can go to sleep with a good feeling. >In your second post you spoke of three "states". I have a question: are these >states equally probable? - Answer: of course NOT! What is the most probable? >Answer: one of them, the draw. As a direct chess variable. >So we have: a) 3 states as outcome for one [!] "measurement". b) a different >pre-event probability of the three outcomes and last but not least c) a shaky >distribution between the three states! Now the super question: what do you >measure, Dann? Three states at a time. Very odd that. I wouldn't speak of >measurement and stats at all! But you still believe in 30 "measurements" >although you don't have "measurements" and certainly no clearly defined variable >you define in advance! Very bad for a testing procedure. The thing I am measuring is the outcome of the game. It is won, lost or drawn. >NB I tell you that a pool of actually modern progs are all equally strong and >when you find differences that they are all by chance! Straw man. >Now prove me why after 30 >games you can still make valid conclusions although you have no exactly defined >variable to "measure". Straw man. >Could you help me to understand why all this could be met by the SSDF practice >of 40 games-matches??? It could be met by individual measurements of single games, over a broad spectrum. The 40 games matches are for expediency, I am sure. You just set up the Auto232 player and let it run for a couple weeks. They combine a large set of measurements to form a single estimate. What does FIDE or the USCF do? They have players play against each other. Usually, we have only a single color for a given event unless is is an important contest like a championship. After a long period of time, we will accumulate several hundred games. The games can be used to form an Elo figure. This Elo figure can be used to make a guess as to how players of a particular class would perform under similar (tournament) conditions against players of the same or of a different class. So (for instance). I could take 100 FIDE players between 2500 and 2600 in ELO and play them against 100 FIDE players between 2300 and 2400 Elo. Using a computer model based on the original Elo figures, I could make an estimate of the total outcomes. Now, this estimator will not tell me anything about any single contest, except a spectrum of what is more likely. We do not know that even the 2600 player will beat the 2300 one. It is the same for the SSDF and for any other body of data. I can take those same FIDE players and have them play online for fun at blitz games. The outcomes might be very different from a serious tournament at FIDE time controls. The same thing is true for SSDF data. >>>Second question: when you have almost equally strong chess programs you are >>>implying that after 30 games you can make a sound conclusion which one is >>>stronger? >> >>No. After 30 games you can start to believe the measurements. >> >>>- If you think you can answer with YES, then I doubt it. >> >>So do I. >> >>> Of course, if >>>you then do - what the SSDF is doing - matches between two unequal progs you can >>>well get clear results after 5 games. >> >>But only by accident. > >I meant outdated progs against newest ones! The older with one, two or even >three generations on their back. I call this bullying. Playing against much weaker or stronger programs does not matter much. Now, if the average Elo of the opponents was in total much lower or higher, that could skew the result. But the weaker opponents are actually the best ones. We have the most games and therefore the most accurate estimates of ability. If we measure with a truck scale, we cannot get as good of a number as if we measure with a laboratory balance. >>>But of course 30 games will be a good >>>profit for the better program. >> >>It will be a good start. >> >>>Also if tests have been known where a directly >>>concurring prog usually gets less points against this out-dated prog... >>> >>>What I want to say is this. It was often explained here in CCC. For good results >>>you must enter into some thousand games mode. 30 games is just for laughter. It >>>is an irrelevant species. >> >>30 games is the bare minimum number to the point where the number may have a >>tiny scrap of validity. If you use less than that, you can be reporting pure >>nonesense. > > >Dann, let's get real. Almost nobody makes matches with only 30 games or less. >Fact is that the SSDF is testing with 40 games-matches. Tell me please now what >you think about their results. I think that their results are very good. I think that less than 1% of the world understands what the results mean. I think it is better data than any other source. I think it is superior in quality to (for instance) the FIDE numbers. >Also note that they are proudly presenting their >number one although the deviations are often bigger than the differences between >the top progs. What should we conclude from that practice? I conclude that it is very good science. Often, it is used for some purpose other than what we can devine from it. But that is not the fault of the SSDF. The error is on the part of those attempting to perform extrapolations from the data. Here is what the SSDF data tells us: If I get the exact computers used in the SSDF, and use the same programs and opening books, I can repeat the experiments that they performed using an Auto232 player. At the end of the experiment, I will expect a similar result to what they saw before. The experiment does not tell us which program is strongest (it never has) but it does give a good indication of strength. The experiment does not tell us how the top programs will perform on a 4 CPU Operteron or on a Beowulf cluster or on a Macintosh.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.