Computer Chess Club Archives


Search

Terms

Messages

Subject: The Validity of CC Testresults - Take my Word for that one!

Author: Rolf Tueschen

Date: 10:41:20 01/20/06

Go up one level in this thread


On January 20, 2006 at 11:51:48, Uri Blass wrote:

>On January 20, 2006 at 05:28:47, Rolf Tueschen wrote:
>
>>On January 20, 2006 at 04:58:11, enrico carrisco wrote:
>>
>>>On January 20, 2006 at 03:14:09, Mike Byrne wrote:
>>>
>>>>http://www.chessolympiad-torino2006.org/eng/index.php?cav=1&dettaglio=309
>>>>
>>>>good stuff...
>>>
>>>Yea -- he even cited the "Anti-computer chess expert" Pablo Ignacio Restrepo.
>>>What more would we need?
>>>
>>>-elc.
>>
>> Yes, this, and then also the point that not automatically everything which is
>>quoted by a GM, here GM Golubev, is similar to Newton's Gravitation Law Paper or
>>Einstein's paper on Relativity. It's a bogus more or less. I want to add a
>>single item so that my opinion doesnt look like a cheap arbitrariness.
>>
>>The CEGT test guys are mentioned (I think some 15 persons) and it sounds as if
>>they were a sort of institution for certain questions in CC. Comparable to what
>>we meant when we spoke of "the new SSDF list" in the 90's. The problem begins if
>>I question that Rybka is already proven the strongest engine today. Then people
>>tell me to look at CEGT where that has been proven... This was a few days ago
>>here in CCC. I must object to such sort of hybris. The truth is that we dont
>>have statistical methods for making such claims. Even after 700 or maybe over
>>1000 games the significance is not so sure and if you look at the +/- boundaries
>>of the so called Elo results then you still have overlappings and you cant say
>>that Rybka is the clear first. - Nothing against the testers of CEGT. The
>>presentation of the results is nice. The games download is also well organised.
>>But all that can't hide the fact that we have certain statistical requirements
>>which must be respected if one wanted to make clear statements. We are all too
>>human. In a world of huge uncertainties and big problems overall, we feel the
>>need to do something for our wellness in such a hobby. Where if not there could
>>we find our peace of mind? We can test. We can create a whole network of
>>testers. But if we then want to make clear statements, alas, we are all standing
>>under the steel hard laws of stats. And basically we cant get what we want to
>>have. We are bound to believe in our private preferences. We can also assume
>>that actually, for a short time, Rybka is "certainly" looking like a very strong
>>engine. But everything above that would be bogus. We should all keep that in
>>mind. The development in CC is always moving. THere is no such thing as the best
>>alltime engine for the next 10 years. If I would get the newest super computers
>>of the US military, it could well be that I become the next World Champion with
>>Gullydeckel, to give an absurd example, or with my personal shooting star The
>>Roaring Thunder which was developed in my kitchen for the next WCCC in Torino...
>>I degress a little bit.
>
>Here are the CEGT single processor results
>
>I ignore single processor result

It striked me with a sort of importunateness when I read today the campaign by
Simon/Pittlik? and Lagershausen and when I read your lecture here, dear Uri, I'm
quite sure that it's impossible to tell people the complex truth, if they are
used to believe in simple truths. I have learned long enough how careful one
should be in statistics. Honestly Uri, what you are doing here is unallowed. You
cant take a list with results and then simply remove certain entries and THEN
compare with their results included. That is your first crass mistake. Of course
also I do know that you cant simply compare 1-processor with 2-processor progs.
And that wasnt at all what I was trying to do.


>
>You can see that single processor programs have less than 2800 when even the 32
>bit version of rybka has bigger rating than 2815 when the top 64 bit version
>even has more than 2850.
>
>No over lapping
>
>1 Rybka 1.01 Beta 9 64-bit opt 2921 73 68 71 80.3 % 2677 33.8 %
>2 Rybka 1.0 Beta 64-bit 2859 21 21 765 68.4 % 2725 32.7 %
>4 Rybka 1.0 Beta 32-bit 2825 10 10 3575 68.9 % 2687 31.0 %
>6 Fruit 2.2.1 2786 8 8 5035 66.0 % 2671 33.1 %
>7 Fritz 9 2782 11 11 2724 62.8 % 2691 30.2 %
>9 TogaII 1.1a 2772 14 14 1560 60.3 % 2699 36.3 %
>10 Hiarcs 10 Hypermodern 2771 22 22 644 53.3 % 2749 35.7 %
>
>The only entry of CEGT that in theory can have more than 2800 on one cpu is deep
>fritz8 but deep fritz8 2 cpu has less than 2800 and it is illogical to expect
>deep fritz8 on one cpu more than it
>
>8 Deep Fritz 8 2CPU 512MB 2772 14 14
>15 Deep Fritz 8 1CPU 2754 107 104
>
>The fact that in part of the other lists rybka is number 1 without an advantage
>that is significant enough probably also increase the certainty that rybka is
>the best engine because the probability of something that is not the best to get
>first place in every serious list is very small.
>


Let's come here to the second crass mistake in your arguments. You see the
result of first place for Rybka like I do that and you conclude that this must
have a proof signal as such. That is the mistake already. Because you conclude
that place one means best strength as such. NB that with stats you measure and
then you claim that your measurement has a validity. Because you kept everything
of importance under control. I simply object that this is wrong for the actual
situation because - as I have already debated with Bob Hyatt - Rybka is in the
initiative actually while all others must react now or tomorrow. But what the
results show is the improments of Rybka against unchanged older progs. And I
claim, without great risks, that any strong program will get in advantage, if
the others couldnt react yet.

So let me combine now the two arguments. I have the overlapping or as you mean
the not completely proven overlapping or the not existing of clear overlapping
and then I have the argument of older progs against a completely new and smartly
designed creation. And on the base of these two aspects I cant see a clear
number one position of Rybka - whatever - 32 or 64. NB at face value yes, Rybka
is leading everywhere. And it has deserved the fame! Noticed how far I go? Of
course I respect Vasik's performance. But I say that the advantage is NOT so
huge that all others could stop programming for being without a chance against
Rybka. No way.



>It is also wrong to combine errors of 2 programs because the error of the
>difference is smaller than the sum of the errors.

That has nothing to do with what I did. I took the negative error of the one and
the positive error of the other and compared then both vslues and I saw that
they fall into the same range. I dont calculate, it's more geometry. :)


>
>If you have 50 elo error in 2 list then the error of the difference is 70-71
>elo(50*sqrt(2)) and not 100 elo.
>
>basically if the errors are a and b then I think that you can use square root of
>the sum of a^2+b^2 for the error of the difference.


Uri, maths is a topic I know as much of, that I know how careful one should
examine the possible calculations. You can begin to calculate like hell, but
usually it's more prudent to reflect what should be done. And I ask you to get
real: if prog A has intervalls say +/- 20 and has 2650 and B has +/- 30 with Elo
2680, then we both should continue like this: A could have 2670 if it's very
lucky and B if it's fallen into a dark hole it has only 2650. So from these
numbers I can conclude with your agreement understood, that A could still be
stronger than B. We dont need to calculate big and we certainly dont do some
squares or other hyperboles. And what we have debated right now is what makes
the SSDF so boring and now it as become the reason for our misunderstandings.
What we had here is BTW watertight maths up to mathematical professors, believe
me.


>
>Uri



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.