Author: Dann Corbit
Date: 11:46:38 01/23/04
Go up one level in this thread
On January 23, 2004 at 10:12:00, Rolf Tueschen wrote: >We just had a little dispute about an old topic. When can we say that a prog is >better than another? How can we proceed to make sound arguments? > >Let me tell the story in fast mode. > >There was a test. I understand with 300 games or such. An incredibly high number >of games because often we have matches with onl 20 or 40 games. > >I understood further that on the base of a confidence intervall of 1-58 we have >95%. > >Now what I want to tell you, and this is undisputable statistical standard: > >if you get a value that is in the intervall, we cannot conclude that the >difference of the two progs is relevant or valid or call it what you want. It >makes no sense to argue with such "low" differences. They could be still be on >the base of chance. Now the distribution of chance is the Bell curve. Nothing >else. > >We had the debate with the SSDF list often enough. > >Two progs stand at the top. One is number one in the ranking. But is it really >stronger than prog number two??? > >The answer is easy. If the normal variation, this famous +- value in the SSDF >list is say +-40 points and the difference between progs is 35 points THEN we >are unable to conclude anything for sure. It could be that 1 is stronger than 2 >but also the contrary could be true. Only from values >40 on we have >"certainty", statistically, that a prog in that specific design is proven >stronger than another one. > >This is all so simply and trivial that it is satifying to be able to clarify. > >Have fun, > >Rolf > >P.S. > >I just want to correct a heavy mistake in a former posting. There it was said >for Elo differences that the difference of say 1 Elo point would be speaking for >a better strength of one prog over another and you needed so and so many gasmes >to prove that... - - this is total nonsense. There is _no_ way to conclude >anything out of an Elo difference of 1 point, no matter if you have 300 or >100000 games. The difference of 1 Elo point is meaningless. It's nonsense to >even think about such neccessary millions of games to "prove" that. Statistics >also has something to do with normal human sense. We would always take such a >difference for _equal_ strength. Empirically, your answer is mostly right. However, you can never have 100% certainty in calculations of this nature. You can be 67% sure. You can be 95% sure. You can be 99% sure. You can be 99.999999999% sure. But you cannot achieve 100% because it requires an infinite number of games or some other absurd boundary conditions. Your assertion that the mean figure is not sure is correct. Expecially when the band is wide, the true position of the mean is a fuzzy figure. Even the very best SSDF figures have a +/- of 44 Elo approximately (with over 1000 games played). Therefore, if two programs are close, it is really irresponsible to say that one is better than the other one. Given this data: 1 Shredder 7.04 UCI 256MB Athlon 1200 MHz 2812 28 -26 781 75% 2623 2 Junior 8.0 256MB Athlon 1200 MHz 2784 32 -30 545 68% 2648 3 Shredder 7.0 256MB Athlon 1200 MHz 2771 27 -25 801 70% 2623 4 Deep Fritz 7.0 256MB Athlon 1200 MHz 2760 26 -25 778 67% 2635 5 Fritz 8.0 256MB Athlon 1200 MHz 2753 24 -23 937 65% 2641 6 Deep Junior 8.0 256MB Athlon 1200 MHz 2748 35 -34 432 64% 2645 7 Hiarcs 9.0 256MB Athlon 1200 MHz 2746 51 -48 209 63% 2653 8 Fritz 7.0 256MB Athlon 1200 MHz 2744 29 -28 634 63% 2653 9 Shredder 6.0 Pad UCI 256MB Athlon 1200 2724 22 -22 1033 62% 2640 10 Shredder 6.0 256MB Athlon 1200 MHz 2723 31 -30 547 62% 2634 11 Chess Tiger 15.0 256MB Athlon 1200 MHz 2719 25 -24 824 60% 2647 12 Chess Tiger 14.0 CB 256MB Athlon 1200 2718 30 -30 557 61% 2639 13 Deep Fritz 256MB Athlon 1200 MHz 2716 30 -29 571 61% 2641 14 Gambit Tiger 2.0 256MB Athlon 1200 2713 29 -29 583 58% 2654 15 Shredder 7.0 UCI 128MB K6-2 450 MHz 2703 40 -39 316 58% 2648 16 Junior 7.0 256MB Athlon 1200 MHz 2699 25 -25 801 54% 2670 17 Hiarcs 8.0 256MB Athlon 1200 MHz 2684 22 -22 996 53% 2663 18 Ruffian 1.0.1 256MB Athlon 1200 MHz 2678 29 -29 565 53% 2657 19 Rebel Century 4.0 256MB Athlon 1200 MHz 2676 29 -29 590 60% 2605 20 Chess Tiger 15.0 128MB K6-2 450 MHz 2664 34 -33 433 52% 2648 21 Shredder 5.32 256MB Athlon 1200 MHz 2660 26 -26 713 51% 2654 22 Gandalf 4.32h 256MB Athlon 1200 MHz 2658 31 -31 514 53% 2635 23 Deep Fritz 7.0 128MB K6-2 450 MHz 2650 36 -35 392 54% 2623 23 Gandalf 5.0 256MB Athlon 1200 MHz 2650 45 -46 242 44% 2693 25 Gandalf 5.1 256MB Athlon 1200 MHz 2639 25 -25 758 55% 2605 26 Junior 7.0 128MB K6-2 450 MHz 2638 22 -22 1030 59% 2573 27 Fritz 7.0 128MB K6-2 450 MHz 2632 38 -37 348 53% 2610 27 Chess Tiger 14.0 CB 128MB K6-2 450 MHz 2632 25 -25 798 57% 2581 29 Shredder 6.0 UCI 128MB K6-2 450 MHz 2617 43 -43 264 52% 2606 30 Crafty 18.12/CB 256MB Athlon 1200 MHz 2615 27 -27 647 52% 260 Even though Crafty 18.12 is 2812-2615=197 Elo below Shredder, and the confidence bands clearly do not overlap, there is still some absurdly remote chance that Crafty would be better (maybe 1e-20 or something -- just a silly guess). But for all practical purposes we can say that Shredder 7.04 is stronger "Under the exact conditions of the experiment". Now, if we put Crafty on a 64 bit, 64 CPU Numa machine running at 3 GHz with a Terrabyte of ram, then all bets are off. ;-)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.