Author: Roger D Davis
Date: 10:54:34 05/25/05
Go up one level in this thread
On May 25, 2005 at 13:31:19, Dann Corbit wrote: >On May 25, 2005 at 12:58:46, Roger D Davis wrote: > >>On May 25, 2005 at 05:35:14, emerson tan wrote: >> >>>s hydra now stronger than deep blue? >> >>We know Kasparov, even then, was a much stronger player than Adams is today. If >>Hydra, supposedly stronger than Deep Blue, loses to a much weaker player, then >>that provides a strong argument that Hydra is weaker than Deep Blue. > >In a short match, anything can happen. The error bar for the Elo calculation >will be nearly a thousand Elo for a match of this length. In short, it tells us >almost nothing about who is stronger. It does tell us who is the winner, and >that is about all. > >>On the other hand, if Adams loses, then it says nothing about Hydra's strength >>relative to Deep Blue. > >We get about the same amount of information either way. > >>I guess you could always argue that Deep Blue can beat Kasparov and Kasparov can >>beat Adams and Adams can beat Hydra and Hydra can beat Deep Blue, but it doesn't >>seem likely. Particularly if Adams can get a convincing score. > >In both cases, the experiments are very simple. A single contest utilizing only >two opponents only tells you about the two combatants relative to each other. >Consider the SSDF ProDeo/Shredder match going on right now. ProDeo is taking a >real butt-whupping. But the single contest is not truly indicitive of ProDeo's >strength. It's just a bad matchup for ProDeo. > >In a similar vein, you need a broad spectrum of opponents to get a good gague of >strength for a chess player (man or machine) in order to make a logical >judgement about how strong they are. > >Consider a contest with 26 games by 14 different programs against each other: > Program Elo + - Games Score Av.Op. Draws > 1 Shredder 9 : 2793 108 138 26 76.9 % 2584 30.8 % > 2 Gandalf 6.01 : 2760 113 170 26 73.1 % 2586 15.4 % > 3 Toga II 0.93 : 2688 125 112 26 63.5 % 2592 34.6 % > 4 List 5.12 : 2662 130 146 26 59.6 % 2594 11.5 % > 5 Ruffian 2.1.0 : 2636 137 83 26 55.8 % 2596 50.0 % > 6 Spike 0.9 : 2611 143 101 26 51.9 % 2598 34.6 % > 7 Deep Sjeng 1.6 : 2586 101 143 26 48.1 % 2600 34.6 % > 8 Zappa 1.0 : 2574 118 140 26 46.2 % 2601 23.1 % > 9 Ktulu 7.0 : 2561 104 137 26 44.2 % 2602 34.6 % >10 Pharaon 3.2 : 2549 112 133 26 42.3 % 2603 30.8 % >11 Fruit 2.0 : 2536 108 130 26 40.4 % 2604 34.6 % >12 Yace 0.99.87 : 2523 91 128 26 38.5 % 2605 46.2 % >13 Patriot 1.3.0 : 2468 143 117 26 30.8 % 2609 23.1 % >14 LambChop 10.99 : 2453 138 115 26 28.8 % 2610 26.9 % > >Notice that the error bars are about 200 Elo wide even with 26 games and with 14 >different opponents. > >With a single opponent and nine games, the error bar is 597 Elo: > > Program Elo + - Games Score Av.Op. Draws > 1 Rebel12_Cb : 2640 236 361 9 83.3 % 2360 11.1 % > 2 Ruffian 1.0.5 : 2360 361 236 9 16.7 % 2640 11.1 % > >Essentially, we cannot tell anything imporant about strength from this match >except that Rebel12_Cb is more likely to be stronger than Ruffian 1.0.5 than the >reverse situation. But even that is very tenuous, given the data used to >compile it. I think you can use statistics to talk yourself out of most anything. The problem with this kind of reasoning---apply error bars and all that---is that it deconstructs the fun of the contest. Suddenly, there's nothing at stake any more, because whichever side wins or loses, we can always say we need a larger sample of games. I guess it depends on what level of alpha you're willing to adopt in order to calculate the error bars and what your purpose is. A 95% confidence interval leads to large error bars, but it's useful if you want to find an effect, while holding the chance of a Type 1 error to 5%. If you a adopt a 99% confidence interval, then that leads to even larger error bars. You might well conclude that very little can be known with 99% certainty. You might conclude that the top 10 programs are indistinguishable in terms of strength, when in fact all you can say is "We cannot conclude with 99% certainty that #10 is less strong than #1." What I said was that an Adams victory creates a strong argument that Hydra is weaker than Deep Blue. I didn't say it created certainty. Does it jump the hurdle of a 99% probability? I really don't know. Does it jump the hurdle of a 95% probability? I don't know that either, but I doubt it. But is it stronger than a coin toss? I think that depends on the magnitude of the victory. If Hydra gets zip and Adams makes it look easy, then I'm more willing to conclude that Adams is much stronger, and that Hydra is weaker than Deep Blue. If Adams wins by 1/2 point, then less is known. Let's just say that as the magnitude of an Adam's victory increases, the more likely it is that Hydra is weaker than Deep Blue. But we'll never know for sure, and there's not can know for sure. Roger
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.