Author: Dann Corbit
Date: 13:53:49 06/15/04
Go up one level in this thread
On June 15, 2004 at 16:05:12, Peter Fendrich wrote: >On June 15, 2004 at 13:38:56, Dann Corbit wrote: > >>On June 12, 2004 at 16:57:06, Peter Fendrich wrote: >>>On June 11, 2004 at 23:35:39, Dann Corbit wrote: >>>>On June 11, 2004 at 09:54:54, Peter Fendrich wrote: >>>>>On June 09, 2004 at 20:24:52, Dann Corbit wrote: >>>>>>On June 09, 2004 at 19:27:37, Derek Paquette wrote: >>>>>>>On June 09, 2004 at 19:23:11, Dann Corbit wrote: >>>>>>>>On June 09, 2004 at 19:07:39, Derek Paquette wrote: >>>>>>>>>On June 09, 2004 at 18:49:40, Jorge Pichard wrote: >>>>>>>>>>Taking on a 3400+ AMD 64 with 2 GB RAM and Fritz 8 >>>>>>>>>>http://www.chessbase.com/newsdetail.asp?newsid=1703 >>>>>>>>>this is very annoying for someone who is a chess enthusiast like myself. >>>>>>>>>why would the company that is marketting this laptop, RISK using a program that >>>>>>>>>is 40 elo LOWER? >>>>>>>>>i just dont' get it, >>>>>>>>>i think it comes down to plain old ignorance of chess programs >>>>>>>>>why NOT use shredder 8? >>>>>>>>>this is very frusterating, because we never get to see shredder 8 in action vs >>>>>>>>>grandmasters at tournament time controls. >>>>>>>> >>>>>>>>Probably, they have a good reason. >>>>>>>>For instance, they might take 7.04 and analyze every game she has every played >>>>>>>>at very slow time control. Now, they have a database and expected response for >>>>>>>>most of the moves she is likely to make. >>>>>>>> >>>>>>>>Perhaps the analysis started long ago. They know for sure exactly how it would >>>>>>>>work with 7.04 >>>>>>>> >>>>>>>>Bleeding edge is not always the best thing, if you want a reliable outcome. >>>>>>>>For the same reason, we won't always see the fastest possible hardware. It >>>>>>>>could be that the fastest stuff has not been tested. It would be a mistake to >>>>>>>>try an untested system. >>>>>>> >>>>>>>that is very true, if shredder 8 was released last week, HOWEVER, >>>>>>>shredder 8 has been released long enough for the following to happen, >>>>>>>SSDF has had enough time to test it >>>>>>>ICC is full of shredder 8 (and it turning humans into mince meat) >>>>>>> >>>>>>>that is enough to say that the program is well tested, and that it would kick >>>>>>>the crap out of a human, because its certainly beating around fritz 8. >>>>>> >>>>>>It it not known whether Fritz 8 would do better against humans than Shredder 8. >>>>>> >>>>>>We might surmise it from SSDF and WMCCC results, but that is really an >>>>>>extrapolation that may not be correct. >>>>> >>>>>I agree to 100%. It's an extrapolation - only experience can tell if it's right. >>>>> >>>>>> >>>>>>At any rate, even the SSDF Elo strength rating also does not decide who is >>>>>>stronger: >>>>> >>>>>This is not the right way to interpret the table. I should know as I once >>>>>designed that table :-) >>>>>First: The ratings 2818 for Schredder and 2790 for Deep Fritz are their ratings >>>>>to the best of our knowledge, given the information we have from results. That >>>>>is the best we can say, regardless of confidence. >>>> >>>>I think "x +/- y" is a better way to say it. >>> >>>No, that's wrong. The rating is well defined exactly as one value. No fuzziness >>>at all even if it will change when you add new games. The "real" rating (we >>>haven't even defined what it is here) will we never know, but it's exactly one >>>value that never change. >>>The interval was invented by me and is not used by Arpad Elo. >>>It is not exactly the same thing to claim that the there is 95% prob that the >>>interval is covering the "real" rating and that the "real" rating is x +/- y >>>Think about it, the interval is jumping around depending on the games but the >>>"real" rating is sitting still. >>> >>>>Either that, or round to 2 digits >>>>of accuracy (which is about what is present). It's not like a measurement of >>>>the height of a tree or the weight of a metal mass. >>> >>>Yes, it is by definition! >> >>You can define it to be the number calculated. >>I can calculate the Elo of Program x as 2350.6932471 against a pool of 4 >>programs after 16 games. >> >>The real number is somewhere between 1000 and 3000. Even the first digit has no >>significance. >> >>My point is that the digits after the 2 have no significance whatsoever. >>As more and more games are built up, more and more digits have meaning. If we >>were to play one trillion games, I could get perhaps 8-9 significant digits. >> >>>>It's a broad sample of data >>>>from a collection which we know will experience a lot of randomness. >>>> >>>>>Second: The interval is another story. We don't know the real rating point. The >>>>>interval [2786,2852] for Shredder is covering the real point with a confidence >>>>>of 95% given the information we have. >>>>>To add and subtract the ratings for different individuals to find out if we have >>>>>an overlap is not the right way to go. If they overlap we can't say anything >>>>>about where the two real ratings are placed without doing some more math. If the >>>>>interval from one of them is covering the estimated rating of the other as it >>>>>does in this case. 2786 is less than 2790 we could probably make some kind of >>>>>statement. >>>>>/Peter >>>> >>>>My point was that: >>>>1. The ratings are fuzzy numbers, not numbers with 4 digits of precision. >>>>2. The rating "area of fuzziness" overlaps for the two programs. That means it >>>>is reasonable to say that it is not proven which is stronger. >>> >>>That is NEVER proven regardless of number of games. >>>An overlap by two 95%-areas can't easily be translated to anything meaningful. >> >>If you have played 100,000 games and the Elo of program A is 2500 +/- 1 and the >>Elo of program b is 2400 +/- 1, then A is stronger than B with a probability >>very nearly 1.0. In other words, we can be more sure that A is stronger than B, >>than the chance that the light will come on when we turn the switch (power might >>be off, it might burn out at the moment of the switch, the switch could go >>defective, a rat could chew the wires...) >> >>>>In LIKELIHOOD, the higher program is probably stronger than the lower program. >>>> >>>>If someone saw: >>>>Program x 2739.0123 >>>>Program y 2699.7865 >>>> >>>>Those may be the absolute x-bar ratings. >>>> >>>>But if only 10 games were played, only the first digit has any meaning. >>>> >>>>So the error bars say as much or more about the meaning of the ratings than the >>>>average itself does. >>> >>>Well after 10 games you can't even rely on the accuracy of error bars and >>>shouldn't use them (based on the bell curve) but the rating is well defined as >>>one value. "x's rating after 10 games is 2739" is a correct statement. >> >>That is misleading and very bad science. >>Why not say that the rating is >>2739.8356245494183672715153891736273563 >>? >>Even though you are not even sure about the leading 2. > >It's exactly here that we disagree. >Good science is using the proper definitions. >It's vague what you mean by rating when you say that the rating is x +/- y. >There are only two types: Performance rating and current rating. Both are >exactly defined as one number (skip the decimals) computed by a well defined >formula. Yes, but why stop at the decimals? That is completely arbitrary. If I measure a boat with a stick that is about a foot long, and I state the length in microns, that is a bad way to report the information. If I had used a micrometer or a laser rangefinder, then it might be reasonable to do so. >They are estimations of the "real" rating (the definition of this for >chess engines is a thread of it's own). >The "real" rating is for ever hidden for us. In the same way, the real measurement is hidden from us for all measurements. I am 6' 3/4" tall, about. If nine people measured me to within the millimeter, I would probably see at least 5 different measurements. And if I were measured in the morning the number will be different than in the evening. If it is 100 degrees at noon, I will be a bit longer than if it is 20 degrees below zero. Every real measurement has uncertainty (even a count is subject to error). In reporting a figure, I think it a very good practice to include a tolerance. So, if I have taken 1000 measurements of my height in bare feet on a doctor's scale, it will probably be better than a single measurement with a logger's tape and a pencil. If I measure with a micrometer, I would give a different indication than if I had measured with a yardstick. >To say that the "real" rating is x +/- y with 95% probability is also vague and >could even be wrong. >The proper definition is that the interval is covering the "real" rating with a >confidence level of 95% > >This is not only about semantics. It's important to use the same definitions in >order to communicate correctly and to interpret the results. How could we ever >understand and evolve the underlying formulas without using the same >definitions? It is important that the definitions are clear. I do not think that the definitions preclude describing how fuzzy the number is. Unless you formally decide that the accuracy does not matter.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.