Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: an example how users - not programmers - use tests

Author: David Dahlem

Date: 12:34:39 06/20/04

Go up one level in this thread


On June 20, 2004 at 12:34:43, Rolf Tueschen wrote:

>On June 20, 2004 at 12:10:57, David Dahlem wrote:
>
>>I've seen numerous examples of one engine solving a test suite position in a few
>>seconds, while another engine of known equal game playing strength never finds
>>the solution, even after hours of analysis. To me, this makes test suites
>>worthless, or at least very difficult to interpret the results.
>>
>>Regards
>>Dave
>
>
>Yes, correct, this is what is called the lack of reliability of the results, as
>Sandro explained. It's a typical wrong with these position tests, but all test
>knowies know it, however the question is how to explain that triviality to lays
>and motivated users and to a founder with a blind spot? In special who is losing
>himself in the circle argument that every critic at first should run the test
>suite because they would THEN realize how good it is. You know from the chess
>quality of these positions on...! I can only repeat this: a famous CC journal
>and a whole team of forum mods who don't want to "hurt" a test founder and so
>tolerate that he loses himself in such a circle - is the main responsible for
>that mess. Because that someone, even a scientist, _can_ go wrong and can't
>realize this, that is not such a seldom event. It doesn't mean that he's bad or
>not intelligent or such. Sometimes you have this "wall" in your head. And you
>can't find a brick. Later you break out into laughter and you wonder why you
>couldn't see it. Here in our case the main founder is a Russian academic doctor
>who certainly has learned the basics of scientific reasoning. Therefore he will
>understand in the end the difference between testing the end-product or a
>prototype. He does also know these two obstacles, namely validity and
>reliability. And he should know that statistical calculation could never
>"create" significance if it's not in the data.
>
>I do also think that we must change a couple of terms. When the users are
>playing with their engines and run them through 100 positions, this can't be
>called "testing"! It looks like but it's not testing.

The best way, and only way, in my opinion, to test engine strength is in actual
game play. The engine that plays "better" moves than the opponent, not
necessarily the "best" move, will determine engine strength more accurately than
all test suites ever created.

Regards
Dave



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.