Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: A 5 minute experiment. Which is stronger. Any takers???

Author: Dann Corbit

Date: 17:52:05 01/13/00

Go up one level in this thread


On January 13, 2000 at 20:23:27, Luis E. Alvarado wrote:
[snip]

>You have a point but, that is why we are forze to rely in the SSDF ratings. If
>FRITZ is rated Higher than Rebel, Then it is stronger.

The SSDF ratings also have a standard deviation figure.  If that number is taken
into account, even within one standard deviation, it is not certain which is
stronger -- Fritz or Rebel.

Nor does it matter much.  99.99% of the people who buy it will not be able to
beat either 99.99% of the time.

But it does make for good one-upsmanship.  "My program can knock the stuffings
out of your program."

OTOH, being ranked in the top ten of that list is a sure indicator of very high
strength.  And the higher up you are, the stronger you probably are.  But you
are not _provably_ stronger than the other programs within one standard
deviation (or the certainty that you are better is very low would be a better
way to describe it).

Consider this (current) list:
http://home3.swipnet.se/~w-36794/ssdf/nr000.htm

Here are the top programs (those which have been benched on the 450 MHz
machines):
 Rating + - Games Won Average opposition
1 Chess Tiger 12.0 DOS 128MB K6-2 450 MHz  2696 44 -40 317 72% 2533
2 Fritz 5.32 128MB K6-2 450 MHz  2671 45 -41 297 72% 2506
3 Nimzo 7.32 128MB K6-2 450 MHz  2663 37 -35 409 69% 2526
4 Nimzo 99 128MB K6-2 450 MHz  2644 52 -48 214 67% 2520
5 Hiarcs 7.32 128MB K6-2 450 MHz  2636 42 -39 320 67% 2509
6 Junior 5.0 128MB K6-2 450 MHz  2619 54 -50 190 65% 2508

The relative ELO of Chess Tiger is 2696 +44/-40 ELO points (to within one
standard deviation).  That means that in this pool of programs, the ELO of CT is
between 2740 and 2656 with a probability of about 2/3 of being correct.  If we
double the standard deviation, the probability will increase to over 9/10.
Under that idea, the ELO of CT could possibly be as high as 2784 or as low as
2616 if we want to be fairly certain that we have the true mark.  As more games
are played, the band will get more narrow.   If we played an infinite number of
games, the width would be zero and we would know exactly the true ELO.

Now, the lowest one on this list of those tested at 450 MHz is Junior.  With an
ELO of 2619, adding two standard deviations would give us an ELO between 2727
and 2519.

So...
CT true ELO probably between [2784 and 2616]
JR true ELO probably between [2717 and 2519]

Notice that 2717 is one hundred points higher than 2616.  So Junior could
(theoretically) be the stronger program.  It may be more likely that it is the
other way around, but we *really* don't know for certain.  It could also be as
weak as 2519 in that pool.  The true figure is *probably* closer to the stated
average but (again) we just can't tell from the data.

The SSDF is probably the strongest indicator for comp/comp performance at the
exact stated conditions of the experiment.  However, as you can easily see, it
does not show what most people think it does.

OTOH, I am very glad that Chess Tiger leads the list because I think Christophe
Theron is a nice guy.

See -- I get emotionally attached too.  So much for scientific objectivity.




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.