Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: is hydra now stronger than deep blue?

Author: Dann Corbit

Date: 11:58:42 05/25/05

Go up one level in this thread


On May 25, 2005 at 13:54:34, Roger D Davis wrote:

>On May 25, 2005 at 13:31:19, Dann Corbit wrote:
>
>>On May 25, 2005 at 12:58:46, Roger D Davis wrote:
>>
>>>On May 25, 2005 at 05:35:14, emerson tan wrote:
>>>
>>>>s hydra now stronger than deep blue?
>>>
>>>We know Kasparov, even then, was a much stronger player than Adams is today. If
>>>Hydra, supposedly stronger than Deep Blue, loses to a much weaker player, then
>>>that provides a strong argument that Hydra is weaker than Deep Blue.
>>
>>In a short match, anything can happen.  The error bar for the Elo calculation
>>will be nearly a thousand Elo for a match of this length.  In short, it tells us
>>almost nothing about who is stronger.  It does tell us who is the winner, and
>>that is about all.
>>
>>>On the other hand, if Adams loses, then it says nothing about Hydra's strength
>>>relative to Deep Blue.
>>
>>We get about the same amount of information either way.
>>
>>>I guess you could always argue that Deep Blue can beat Kasparov and Kasparov can
>>>beat Adams and Adams can beat Hydra and Hydra can beat Deep Blue, but it doesn't
>>>seem likely. Particularly if Adams can get a convincing score.
>>
>>In both cases, the experiments are very simple.  A single contest utilizing only
>>two opponents only tells you about the two combatants relative to each other.
>>Consider the SSDF ProDeo/Shredder match going on right now.  ProDeo is taking a
>>real butt-whupping.  But the single contest is not truly indicitive of ProDeo's
>>strength.  It's just a bad matchup for ProDeo.
>>
>>In a similar vein, you need a broad spectrum of opponents to get a good gague of
>>strength for a chess player (man or machine) in order to make a logical
>>judgement about how strong they are.
>>
>>Consider a contest with 26 games by 14 different programs against each other:
>>   Program          Elo    +   -   Games   Score   Av.Op.  Draws
>> 1 Shredder 9     : 2793  108 138    26    76.9 %   2584   30.8 %
>> 2 Gandalf 6.01   : 2760  113 170    26    73.1 %   2586   15.4 %
>> 3 Toga II 0.93   : 2688  125 112    26    63.5 %   2592   34.6 %
>> 4 List 5.12      : 2662  130 146    26    59.6 %   2594   11.5 %
>> 5 Ruffian 2.1.0  : 2636  137  83    26    55.8 %   2596   50.0 %
>> 6 Spike 0.9      : 2611  143 101    26    51.9 %   2598   34.6 %
>> 7 Deep Sjeng 1.6 : 2586  101 143    26    48.1 %   2600   34.6 %
>> 8 Zappa 1.0      : 2574  118 140    26    46.2 %   2601   23.1 %
>> 9 Ktulu 7.0      : 2561  104 137    26    44.2 %   2602   34.6 %
>>10 Pharaon 3.2    : 2549  112 133    26    42.3 %   2603   30.8 %
>>11 Fruit 2.0      : 2536  108 130    26    40.4 %   2604   34.6 %
>>12 Yace 0.99.87   : 2523   91 128    26    38.5 %   2605   46.2 %
>>13 Patriot 1.3.0  : 2468  143 117    26    30.8 %   2609   23.1 %
>>14 LambChop 10.99 : 2453  138 115    26    28.8 %   2610   26.9 %
>>
>>Notice that the error bars are about 200 Elo wide even with 26 games and with 14
>>different opponents.
>>
>>With a single opponent and nine games, the error bar is 597 Elo:
>>
>>   Program          Elo    +   -   Games   Score   Av.Op.  Draws
>> 1 Rebel12_Cb     : 2640  236 361     9    83.3 %   2360   11.1 %
>> 2 Ruffian 1.0.5  : 2360  361 236     9    16.7 %   2640   11.1 %
>>
>>Essentially, we cannot tell anything imporant about strength from this match
>>except that Rebel12_Cb is more likely to be stronger than Ruffian 1.0.5 than the
>>reverse situation.  But even that is very tenuous, given the data used to
>>compile it.
>
>
>I think you can use statistics to talk yourself out of most anything. The
>problem with this kind of reasoning---apply error bars and all that---is that it
>deconstructs the fun of the contest. Suddenly, there's nothing at stake any
>more, because whichever side wins or loses, we can always say we need a larger
>sample of games.
>
>I guess it depends on what level of alpha you're willing to adopt in order to
>calculate the error bars and what your purpose is. A 95% confidence interval
>leads to large error bars, but it's useful if you want to find an effect, while
>holding the chance of a Type 1 error to 5%.
>
>If you a adopt a 99% confidence interval, then that leads to even larger error
>bars. You might well conclude that very little can be known with 99% certainty.
>You might conclude that the top 10 programs are indistinguishable in terms of
>strength, when in fact all you can say is "We cannot conclude with 99% certainty
>that #10 is less strong than #1."
>
>What I said was that an Adams victory creates a strong argument that Hydra is
>weaker than Deep Blue. I didn't say it created certainty. Does it jump the
>hurdle of a 99% probability? I really don't know. Does it jump the hurdle of a
>95% probability? I don't know that either, but I doubt it.
>
>But is it stronger than a coin toss? I think that depends on the magnitude of
>the victory. If Hydra gets zip and Adams makes it look easy, then I'm more
>willing to conclude that Adams is much stronger, and that Hydra is weaker than
>Deep Blue. If Adams wins by 1/2 point, then less is known. Let's just say that
>as the magnitude of an Adam's victory increases, the more likely it is that
>Hydra is weaker than Deep Blue.
>
>But we'll never know for sure, and there's not can know for sure.

You are right that a dominating performance by either side is more powerful
evidence than a nearly equal contest.

But even a 7-0 shellacking is not proof that the winning side is stronger.

I will always call for more data until the thing is proven within a reasonable
doubt.  There will always be some error bar, and we may never collect enough
data to know for sure.  So in cases like that (in my opinion) it is better to
say that we are not sure than to pretend that we are sure.

If we had 1000 games of Adams verses Hydra, if one were 100 Elo stronger, then
we would know it.

We'll probably never get those games, which means we will never know the answer
for sure.  But that's not all bad.  There are lots and lots of questions we do
not have the answer for.  Trying to answer (or at least clarify) a few of them
is always a worthwhile goal.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.