Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Testmethods for n=0, n=1 and n=>800 - For Beginners and 'old Hands'

Author: Rolf Tueschen
Date: 09:55:56 09/13/02
On September 13, 2002 at 12:26:33, Joachim Rang wrote:

>On September 13, 2002 at 10:54:27, Rolf Tueschen wrote:
>
>>On September 13, 2002 at 10:39:48, Joachim Rang wrote:
>>
>>>I disagree:
>>>
>>>If you got a result 52-48 you can't say, which engine is better, but if you got
>>>a result 5200-4800 you can at least with 99% probability say, that program A
>>>performs better against program B (which doesn't mean, that program A performs
>>>better than B against other programs).
>>
>><smile>
>>
>>And you were sure that A is "better" than B?
>>
>>But I went too far. I ask you: Are you _sure_ that for CC and the many variables
>>uncontrolled you know then the better performance with 99%?
>>
>>Prove it. But please not just by reading in the tables in Books on Statistics.
>>Also elaborate why you are allowed to make use of the specific tables. You made
>>the necessary checks? You have all variables under control? (etc.)
>>
>>Rolf Tueschen
>
>
>whats your point? I'm not sure, but I can assume with 99% probability that
>program A performs better against program B. And if I test program A against all
>other programs N, with similiar results I can assume with 99% probability that
>program A is _better_ than the other programs. Maybe I'm wrong and it's only 95%
>probability or maybe only 90 %, but in either case I got a high probability.
>
>Well, I can't prove that, but what are the indications that one can't assume
>whis this probabilities?

Easy one.

I differentiate between the mere factual of numbers in results or statistics and
the real meaning also under the aspect that all the laws of say statistics must
be respected. Simply because otherwise the best routine makes no sense.
Just to give a single example. If you had some bias in the versions the question
could be why then the "better" prog did only win by  400 points in your example
with almost 10 000 games. Did the learning prevent the defeat? If learning was a
completely uncontrolled bias among the two progs?

In short, all these questions must be reflected before you start your whole test
or before you make any conclusions out of your results. The mere multiplication
of N alone can't bring you closer to the wished result in the end and could be a
big waste of time.It is not ok to assume, ok, now I made 40 games, but I could
make 400 and then I had certainty. Ok, if I had the time and could run a test
with 5000 games I were the best tester in the world. My point is that this is
pure nonsense. Therefore I always read with reserve when SSDF proudly mentioned
their 20 or 60 thousand games in these two decades. This number alone means
nothing at all. If I like cherries, a complete list of vegetables is no help for
me, in special if there are many beans again this season. I'm looking for
cherries. Cherries have several factors I like. Colour, perfume, taste, size, to
name just a few. Now what is this, when SSDF is testing strength, if the leading
progs are almost the same. There is too much bias in SSDF. Much better were
judgements like JUNIOR plays inspired chess by saccing and exploitating the
chaos. FRITZ is deepest. If that were true. Or XY has the best learning feature.
Does the SSDF or anybody else research such questions? Of course not. And here I
am on the side of certain critics. Suddenly the autoplayer was invented. And
without further thinking SSDF thought that this was a terribly good idea. But as
we know the concentration on mere autoplaying is resulting into nothing. The
resulting differences are not significant. And quality is not being tested at
all. I could continue like this for weeks, but the famous funnel from Nuremberg
is not the best method to teach. Let's see what the debate brings.

Rolf Tueschen
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.