Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Being better...

Author: Dann Corbit

Date: 11:46:38 01/23/04

Go up one level in this thread


On January 23, 2004 at 10:12:00, Rolf Tueschen wrote:

>We just had a little dispute about an old topic. When can we say that a prog is
>better than another? How can we proceed to make sound arguments?
>
>Let me tell the story in fast mode.
>
>There was a test. I understand with 300 games or such. An incredibly high number
>of games because often we have matches with onl 20 or 40 games.
>
>I understood further that on the base of a confidence intervall of 1-58 we have
>95%.
>
>Now what I want to tell you, and this is undisputable statistical standard:
>
>if you get a value that is in the intervall, we cannot conclude that the
>difference of the two progs is relevant or valid or call it what you want. It
>makes no sense to argue with such "low" differences. They could be still be on
>the base of chance. Now the distribution of chance is the Bell curve. Nothing
>else.
>
>We had the debate with the SSDF list often enough.
>
>Two progs stand at the top. One is number one in the ranking. But  is it really
>stronger than prog number two???
>
>The answer is easy. If the normal variation, this famous +- value in the SSDF
>list is say +-40 points and the difference between progs is 35 points THEN we
>are unable to conclude anything for sure. It could be that 1 is stronger than 2
>but also the contrary could be true. Only from values >40 on we have
>"certainty", statistically, that a prog in that specific design is proven
>stronger than another one.
>
>This is all so simply and trivial that it is satifying to be able to clarify.
>
>Have fun,
>
>Rolf
>
>P.S.
>
>I just want to correct a heavy mistake in a former posting. There it was said
>for Elo differences that the difference of say 1 Elo point would be speaking for
>a better strength of one prog over another and you needed so and so many gasmes
>to prove that... - - this is total nonsense. There is _no_ way to conclude
>anything out of an Elo difference of 1 point, no matter if you have 300 or
>100000 games. The difference of 1 Elo point is meaningless. It's nonsense to
>even think about such neccessary millions of games to "prove" that. Statistics
>also has something to do with normal human sense. We would always take such a
>difference for _equal_ strength.

Empirically, your answer is mostly right.

However, you can never have 100% certainty in calculations of this nature.

You can be 67% sure.

You can be 95% sure.

You can be 99% sure.

You can be 99.999999999% sure.

But you cannot achieve 100% because it requires an infinite number of games or
some other absurd boundary conditions.

Your assertion that the mean figure is not sure is correct.  Expecially when the
band is wide, the true position of the mean is a fuzzy figure.  Even the very
best SSDF figures have a +/- of 44 Elo approximately (with over 1000 games
played).  Therefore, if two programs are close, it is really irresponsible to
say that one is better than the other one.

Given this data:
1 Shredder 7.04 UCI 256MB Athlon 1200 MHz  2812 28 -26 781 75% 2623
2 Junior 8.0 256MB Athlon 1200 MHz  2784 32 -30 545 68% 2648
3 Shredder 7.0 256MB Athlon 1200 MHz  2771 27 -25 801 70% 2623
4 Deep Fritz 7.0 256MB Athlon 1200 MHz  2760 26 -25 778 67% 2635
5 Fritz 8.0 256MB Athlon 1200 MHz  2753 24 -23 937 65% 2641
6 Deep Junior 8.0 256MB Athlon 1200 MHz  2748 35 -34 432 64% 2645
7 Hiarcs 9.0 256MB Athlon 1200 MHz  2746 51 -48 209 63% 2653
8 Fritz 7.0 256MB Athlon 1200 MHz  2744 29 -28 634 63% 2653
9 Shredder 6.0 Pad UCI 256MB Athlon 1200  2724 22 -22 1033 62% 2640
10 Shredder 6.0 256MB Athlon 1200 MHz  2723 31 -30 547 62% 2634
11 Chess Tiger 15.0 256MB Athlon 1200 MHz  2719 25 -24 824 60% 2647
12 Chess Tiger 14.0 CB 256MB Athlon 1200  2718 30 -30 557 61% 2639
13 Deep Fritz 256MB Athlon 1200 MHz  2716 30 -29 571 61% 2641
14 Gambit Tiger 2.0 256MB Athlon 1200  2713 29 -29 583 58% 2654
15 Shredder 7.0 UCI 128MB K6-2 450 MHz  2703 40 -39 316 58% 2648
16 Junior 7.0 256MB Athlon 1200 MHz  2699 25 -25 801 54% 2670
17 Hiarcs 8.0 256MB Athlon 1200 MHz  2684 22 -22 996 53% 2663
18 Ruffian 1.0.1 256MB Athlon 1200 MHz  2678 29 -29 565 53% 2657
19 Rebel Century 4.0 256MB Athlon 1200 MHz  2676 29 -29 590 60% 2605
20 Chess Tiger 15.0 128MB K6-2 450 MHz  2664 34 -33 433 52% 2648
21 Shredder 5.32 256MB Athlon 1200 MHz  2660 26 -26 713 51% 2654
22 Gandalf 4.32h 256MB Athlon 1200 MHz  2658 31 -31 514 53% 2635
23 Deep Fritz 7.0 128MB K6-2 450 MHz  2650 36 -35 392 54% 2623
23 Gandalf 5.0 256MB Athlon 1200 MHz  2650 45 -46 242 44% 2693
25 Gandalf 5.1 256MB Athlon 1200 MHz  2639 25 -25 758 55% 2605
26 Junior 7.0 128MB K6-2 450 MHz  2638 22 -22 1030 59% 2573
27 Fritz 7.0 128MB K6-2 450 MHz  2632 38 -37 348 53% 2610
27 Chess Tiger 14.0 CB 128MB K6-2 450 MHz  2632 25 -25 798 57% 2581
29 Shredder 6.0 UCI 128MB K6-2 450 MHz  2617 43 -43 264 52% 2606
30 Crafty 18.12/CB 256MB Athlon 1200 MHz  2615 27 -27 647 52% 260

Even though Crafty 18.12 is 2812-2615=197 Elo below Shredder, and the confidence
bands clearly do not overlap, there is still some absurdly remote chance that
Crafty would be better (maybe 1e-20 or something -- just a silly guess). But for
all practical purposes we can say that Shredder 7.04 is stronger "Under the
exact conditions of the experiment".

Now, if we put Crafty on a 64 bit, 64 CPU Numa machine running at 3 GHz with a
Terrabyte of ram, then all bets are off.
;-)



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.