Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: This Super Laptop with Fritz 8 would even beat Judith Polgar!

Author: Dann Corbit

Date: 13:53:49 06/15/04

Go up one level in this thread


On June 15, 2004 at 16:05:12, Peter Fendrich wrote:

>On June 15, 2004 at 13:38:56, Dann Corbit wrote:
>
>>On June 12, 2004 at 16:57:06, Peter Fendrich wrote:
>>>On June 11, 2004 at 23:35:39, Dann Corbit wrote:
>>>>On June 11, 2004 at 09:54:54, Peter Fendrich wrote:
>>>>>On June 09, 2004 at 20:24:52, Dann Corbit wrote:
>>>>>>On June 09, 2004 at 19:27:37, Derek Paquette wrote:
>>>>>>>On June 09, 2004 at 19:23:11, Dann Corbit wrote:
>>>>>>>>On June 09, 2004 at 19:07:39, Derek Paquette wrote:
>>>>>>>>>On June 09, 2004 at 18:49:40, Jorge Pichard wrote:
>>>>>>>>>>Taking on a 3400+ AMD 64 with 2 GB RAM and Fritz 8
>>>>>>>>>>http://www.chessbase.com/newsdetail.asp?newsid=1703
>>>>>>>>>this is very annoying for someone who is a chess enthusiast like myself.
>>>>>>>>>why would the company that is marketting this laptop, RISK using a program that
>>>>>>>>>is 40 elo LOWER?
>>>>>>>>>i just dont' get it,
>>>>>>>>>i think it comes down to plain old ignorance of chess programs
>>>>>>>>>why NOT use shredder 8?
>>>>>>>>>this is very frusterating, because we never get to see shredder 8 in action vs
>>>>>>>>>grandmasters at tournament time controls.
>>>>>>>>
>>>>>>>>Probably, they have a good reason.
>>>>>>>>For instance, they might take 7.04 and analyze every game she has every played
>>>>>>>>at very slow time control.  Now, they have a database and expected response for
>>>>>>>>most of the moves she is likely to make.
>>>>>>>>
>>>>>>>>Perhaps the analysis started long ago.  They know for sure exactly how it would
>>>>>>>>work with 7.04
>>>>>>>>
>>>>>>>>Bleeding edge is not always the best thing, if you want a reliable outcome.
>>>>>>>>For the same reason, we won't always see the fastest possible hardware.  It
>>>>>>>>could be that the fastest stuff has not been tested.  It would be a mistake to
>>>>>>>>try an untested system.
>>>>>>>
>>>>>>>that is very true, if shredder 8 was released last week, HOWEVER,
>>>>>>>shredder 8 has been released long enough for the following to happen,
>>>>>>>SSDF has had enough time to test it
>>>>>>>ICC is full of shredder 8 (and it turning humans into mince meat)
>>>>>>>
>>>>>>>that is enough to say that the program is well tested, and that it would kick
>>>>>>>the crap out of a human, because its certainly beating around fritz 8.
>>>>>>
>>>>>>It it not known whether Fritz 8 would do better against humans than Shredder 8.
>>>>>>
>>>>>>We might surmise it from SSDF and WMCCC results, but that is really an
>>>>>>extrapolation that may not be correct.
>>>>>
>>>>>I agree to 100%. It's an extrapolation - only experience can tell if it's right.
>>>>>
>>>>>>
>>>>>>At any rate, even the SSDF Elo strength rating also does not decide who is
>>>>>>stronger:
>>>>>
>>>>>This is not the right way to interpret the table. I should know as I once
>>>>>designed that table :-)
>>>>>First: The ratings 2818 for Schredder and 2790 for Deep Fritz are their ratings
>>>>>to the best of our knowledge, given the information we have from results. That
>>>>>is the best we can say, regardless of confidence.
>>>>
>>>>I think "x +/- y" is a better way to say it.
>>>
>>>No, that's wrong. The rating is well defined exactly as one value. No fuzziness
>>>at all even if it will change when you add new games. The "real" rating (we
>>>haven't even defined what it is here) will we never know, but it's exactly one
>>>value that never change.
>>>The interval was invented by me and is not used by Arpad Elo.
>>>It is not exactly the same thing to claim that the there is 95% prob that the
>>>interval is covering the "real" rating and that the "real" rating is x +/- y
>>>Think about it, the interval is jumping around depending on the games but the
>>>"real" rating is sitting still.
>>>
>>>>Either that, or round to 2 digits
>>>>of accuracy (which is about what is present).  It's not like a measurement of
>>>>the height of a tree or the weight of a metal mass.
>>>
>>>Yes, it is by definition!
>>
>>You can define it to be the number calculated.
>>I can calculate the Elo of Program x as 2350.6932471 against  a pool of 4
>>programs after 16 games.
>>
>>The real number is somewhere between 1000 and 3000.  Even the first digit has no
>>significance.
>>
>>My point is that the digits after the 2 have no significance whatsoever.
>>As more and more games are built up, more and more digits have meaning.  If we
>>were to play one trillion games, I could get perhaps 8-9 significant digits.
>>
>>>>It's a broad sample of data
>>>>from a collection which we know will experience a lot of randomness.
>>>>
>>>>>Second: The interval is another story. We don't know the real rating point. The
>>>>>interval [2786,2852] for Shredder is covering the real point with a confidence
>>>>>of 95% given the information we have.
>>>>>To add and subtract the ratings for different individuals to find out if we have
>>>>>an overlap is not the right way to go. If they overlap we can't say anything
>>>>>about where the two real ratings are placed without doing some more math. If the
>>>>>interval from one of them is covering the estimated rating of the other as it
>>>>>does in this case. 2786 is less than 2790 we could probably make some kind of
>>>>>statement.
>>>>>/Peter
>>>>
>>>>My point was that:
>>>>1.  The ratings are fuzzy numbers, not numbers with 4 digits of precision.
>>>>2.  The rating "area of fuzziness" overlaps for the two programs.  That means it
>>>>is reasonable to say that it is not proven which is stronger.
>>>
>>>That is NEVER proven regardless of number of games.
>>>An overlap by two 95%-areas can't easily be translated to anything meaningful.
>>
>>If you have played 100,000 games and the Elo of program A is 2500 +/- 1 and the
>>Elo of program b is 2400 +/- 1, then A is stronger than B with a probability
>>very nearly 1.0.  In other words, we can be more sure that A is stronger than B,
>>than the chance that the light will come on when we turn the switch (power might
>>be off, it might burn out at the moment of the switch, the switch could go
>>defective, a rat could chew the wires...)
>>
>>>>In LIKELIHOOD, the higher program is probably stronger than the lower program.
>>>>
>>>>If someone saw:
>>>>Program x 2739.0123
>>>>Program y 2699.7865
>>>>
>>>>Those may be the absolute x-bar ratings.
>>>>
>>>>But if only 10 games were played, only the first digit has any meaning.
>>>>
>>>>So the error bars say as much or more about the meaning of the ratings than the
>>>>average itself does.
>>>
>>>Well after 10 games you can't even rely on the accuracy of error bars and
>>>shouldn't use them (based on the bell curve) but the rating is well defined as
>>>one value. "x's rating after 10 games is 2739" is a correct statement.
>>
>>That is misleading and very bad science.
>>Why not say that the rating is
>>2739.8356245494183672715153891736273563
>>?
>>Even though you are not even sure about the leading 2.
>
>It's exactly here that we disagree.
>Good science is using the proper definitions.
>It's vague what you mean by rating when you say that the rating is x +/- y.
>There are only two types: Performance rating and current rating. Both are
>exactly defined as one number (skip the decimals) computed by a well defined
>formula.

Yes, but why stop at the decimals?  That is completely arbitrary.  If I measure
a boat with a stick that is about a foot long, and I state the length in
microns, that is a bad way to report the information.  If I had used a
micrometer or a laser rangefinder, then it might be reasonable to do so.

>They are estimations of the "real" rating (the definition of this for
>chess engines is a thread of it's own).
>The "real" rating is for ever hidden for us.

In the same way, the real measurement is hidden from us for all measurements.  I
am 6' 3/4" tall, about.  If nine people measured me to within the millimeter, I
would probably see at least 5 different measurements.  And if I were measured in
the morning the number will be different than in the evening.  If it is 100
degrees at noon, I will be a bit longer than if it is 20 degrees below zero.

Every real measurement has uncertainty (even a count is subject to error).
In reporting a figure, I think it a very good practice to include a tolerance.

So, if I have taken 1000 measurements of my height in bare feet on a doctor's
scale, it will probably be better than a single measurement with a logger's tape
and a pencil.

If I measure with a micrometer, I would give a different indication than if I
had measured with a yardstick.

>To say that the "real" rating is x +/- y with 95% probability is also vague and
>could even be wrong.
>The proper definition is that the interval is covering the "real" rating with a
>confidence level of 95%
>
>This is not only about semantics. It's important to use the same definitions in
>order to communicate correctly and to interpret the results. How could we ever
>understand and evolve the underlying formulas without using the same
>definitions?

It is important that the definitions are clear.  I do not think that the
definitions preclude describing how fuzzy the number is.  Unless you formally
decide that the accuracy does not matter.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.