Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: an example how users - not programmers - use tests

Author: Rolf Tueschen

Date: 10:20:56 06/20/04

Go up one level in this thread


On June 20, 2004 at 13:10:23, Uri Blass wrote:

>On June 20, 2004 at 12:53:50, Rolf Tueschen wrote:
>
>>On June 20, 2004 at 12:39:54, Uri Blass wrote:
>>
>>>On June 20, 2004 at 12:34:43, Rolf Tueschen wrote:
>>>
>>>>On June 20, 2004 at 12:10:57, David Dahlem wrote:
>>>>
>>>>>I've seen numerous examples of one engine solving a test suite position in a few
>>>>>seconds, while another engine of known equal game playing strength never finds
>>>>>the solution, even after hours of analysis. To me, this makes test suites
>>>>>worthless, or at least very difficult to interpret the results.
>>>>>
>>>>>Regards
>>>>>Dave
>>>>
>>>>
>>>>Yes, correct, this is what is called the lack of reliability of the results, as
>>>>Sandro explained. It's a typical wrong with these position tests, but all test
>>>>knowies know it, however the question is how to explain that triviality to lays
>>>>and motivated users and to a founder with a blind spot? In special who is losing
>>>>himself in the circle argument that every critic at first should run the test
>>>>suite because they would THEN realize how good it is. You know from the chess
>>>>quality of these positions on...! I can only repeat this: a famous CC journal
>>>>and a whole team of forum mods who don't want to "hurt" a test founder and so
>>>>tolerate that he loses himself in such a circle - is the main responsible for
>>>>that mess. Because that someone, even a scientist, _can_ go wrong and can't
>>>>realize this, that is not such a seldom event. It doesn't mean that he's bad or
>>>>not intelligent or such. Sometimes you have this "wall" in your head. And you
>>>>can't find a brick. Later you break out into laughter and you wonder why you
>>>>couldn't see it. Here in our case the main founder is a Russian academic doctor
>>>>who certainly has learned the basics of scientific reasoning. Therefore he will
>>>>understand in the end the difference between testing the end-product or a
>>>>prototype. He does also know these two obstacles, namely validity and
>>>>reliability. And he should know that statistical calculation could never
>>>>"create" significance if it's not in the data.
>>>>
>>>>I do also think that we must change a couple of terms. When the users are
>>>>playing with their engines and run them through 100 positions, this can't be
>>>>called "testing"! It looks like but it's not testing.
>>>
>>>It's testing.
>>>
>>>It tests exactly which engine performs the best in the 100 positions that you
>>>choose.
>>>Which engine is the best is a different question.
>>>
>>>Uri
>>
>>
>>Now you fell into a trap, Uri. Of course you can see who does "best" or "not so
>>good" but you can't know what it means!! Because programmers know that you must
>>only tweak a bit your code and whoopie - your program becomes better and better
>>in these positions - BUT - BUT - BUT the playing strength didn't increase at the
>>same time. Worse - you know that if you tweak into that direction the playing
>>strength _decreases_! Would you call it a "good" result???
>>
>>The users who run these positions with their *ready made* products seemingly
>>research and test and can also make a ranking, but it has almost nothing to do
>>with playing strength in tournaments. If SHREDDER then is in the top ranks this
>>is NOT because this test is good but because the tested SHREDDER was only
>>released because it played "well". So - this result is nothing "new". Hence you
>>can't call it "testing".
>
>I did not claim that it tests playing strength but only that it is tests
>something.
>
>Note that I do not use that test.
>
>I use other tests mainly gcp and arasan test suites to test search changes and
>when I analyze the results I do not consider only number of solutions because I
>know that the engine can solve a position for the wrong reason so I look in
>positions and try to find the reasons for differences in order to learn(I know
>that not always the version that seems better is the better version and there
>may be cases when a version solved the problem for the wrong reasons).

For these reasons we would call it not a good test. :)



>
>Uri



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.