Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: an example how users - not programmers - use tests

Author: Rolf Tueschen

Date: 09:53:50 06/20/04

Go up one level in this thread


On June 20, 2004 at 12:39:54, Uri Blass wrote:

>On June 20, 2004 at 12:34:43, Rolf Tueschen wrote:
>
>>On June 20, 2004 at 12:10:57, David Dahlem wrote:
>>
>>>I've seen numerous examples of one engine solving a test suite position in a few
>>>seconds, while another engine of known equal game playing strength never finds
>>>the solution, even after hours of analysis. To me, this makes test suites
>>>worthless, or at least very difficult to interpret the results.
>>>
>>>Regards
>>>Dave
>>
>>
>>Yes, correct, this is what is called the lack of reliability of the results, as
>>Sandro explained. It's a typical wrong with these position tests, but all test
>>knowies know it, however the question is how to explain that triviality to lays
>>and motivated users and to a founder with a blind spot? In special who is losing
>>himself in the circle argument that every critic at first should run the test
>>suite because they would THEN realize how good it is. You know from the chess
>>quality of these positions on...! I can only repeat this: a famous CC journal
>>and a whole team of forum mods who don't want to "hurt" a test founder and so
>>tolerate that he loses himself in such a circle - is the main responsible for
>>that mess. Because that someone, even a scientist, _can_ go wrong and can't
>>realize this, that is not such a seldom event. It doesn't mean that he's bad or
>>not intelligent or such. Sometimes you have this "wall" in your head. And you
>>can't find a brick. Later you break out into laughter and you wonder why you
>>couldn't see it. Here in our case the main founder is a Russian academic doctor
>>who certainly has learned the basics of scientific reasoning. Therefore he will
>>understand in the end the difference between testing the end-product or a
>>prototype. He does also know these two obstacles, namely validity and
>>reliability. And he should know that statistical calculation could never
>>"create" significance if it's not in the data.
>>
>>I do also think that we must change a couple of terms. When the users are
>>playing with their engines and run them through 100 positions, this can't be
>>called "testing"! It looks like but it's not testing.
>
>It's testing.
>
>It tests exactly which engine performs the best in the 100 positions that you
>choose.
>Which engine is the best is a different question.
>
>Uri


Now you fell into a trap, Uri. Of course you can see who does "best" or "not so
good" but you can't know what it means!! Because programmers know that you must
only tweak a bit your code and whoopie - your program becomes better and better
in these positions - BUT - BUT - BUT the playing strength didn't increase at the
same time. Worse - you know that if you tweak into that direction the playing
strength _decreases_! Would you call it a "good" result???

The users who run these positions with their *ready made* products seemingly
research and test and can also make a ranking, but it has almost nothing to do
with playing strength in tournaments. If SHREDDER then is in the top ranks this
is NOT because this test is good but because the tested SHREDDER was only
released because it played "well". So - this result is nothing "new". Hence you
can't call it "testing".



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.