Author: Rolf Tueschen
Date: 10:20:56 06/20/04
Go up one level in this thread
On June 20, 2004 at 13:10:23, Uri Blass wrote: >On June 20, 2004 at 12:53:50, Rolf Tueschen wrote: > >>On June 20, 2004 at 12:39:54, Uri Blass wrote: >> >>>On June 20, 2004 at 12:34:43, Rolf Tueschen wrote: >>> >>>>On June 20, 2004 at 12:10:57, David Dahlem wrote: >>>> >>>>>I've seen numerous examples of one engine solving a test suite position in a few >>>>>seconds, while another engine of known equal game playing strength never finds >>>>>the solution, even after hours of analysis. To me, this makes test suites >>>>>worthless, or at least very difficult to interpret the results. >>>>> >>>>>Regards >>>>>Dave >>>> >>>> >>>>Yes, correct, this is what is called the lack of reliability of the results, as >>>>Sandro explained. It's a typical wrong with these position tests, but all test >>>>knowies know it, however the question is how to explain that triviality to lays >>>>and motivated users and to a founder with a blind spot? In special who is losing >>>>himself in the circle argument that every critic at first should run the test >>>>suite because they would THEN realize how good it is. You know from the chess >>>>quality of these positions on...! I can only repeat this: a famous CC journal >>>>and a whole team of forum mods who don't want to "hurt" a test founder and so >>>>tolerate that he loses himself in such a circle - is the main responsible for >>>>that mess. Because that someone, even a scientist, _can_ go wrong and can't >>>>realize this, that is not such a seldom event. It doesn't mean that he's bad or >>>>not intelligent or such. Sometimes you have this "wall" in your head. And you >>>>can't find a brick. Later you break out into laughter and you wonder why you >>>>couldn't see it. Here in our case the main founder is a Russian academic doctor >>>>who certainly has learned the basics of scientific reasoning. Therefore he will >>>>understand in the end the difference between testing the end-product or a >>>>prototype. He does also know these two obstacles, namely validity and >>>>reliability. And he should know that statistical calculation could never >>>>"create" significance if it's not in the data. >>>> >>>>I do also think that we must change a couple of terms. When the users are >>>>playing with their engines and run them through 100 positions, this can't be >>>>called "testing"! It looks like but it's not testing. >>> >>>It's testing. >>> >>>It tests exactly which engine performs the best in the 100 positions that you >>>choose. >>>Which engine is the best is a different question. >>> >>>Uri >> >> >>Now you fell into a trap, Uri. Of course you can see who does "best" or "not so >>good" but you can't know what it means!! Because programmers know that you must >>only tweak a bit your code and whoopie - your program becomes better and better >>in these positions - BUT - BUT - BUT the playing strength didn't increase at the >>same time. Worse - you know that if you tweak into that direction the playing >>strength _decreases_! Would you call it a "good" result??? >> >>The users who run these positions with their *ready made* products seemingly >>research and test and can also make a ranking, but it has almost nothing to do >>with playing strength in tournaments. If SHREDDER then is in the top ranks this >>is NOT because this test is good but because the tested SHREDDER was only >>released because it played "well". So - this result is nothing "new". Hence you >>can't call it "testing". > >I did not claim that it tests playing strength but only that it is tests >something. > >Note that I do not use that test. > >I use other tests mainly gcp and arasan test suites to test search changes and >when I analyze the results I do not consider only number of solutions because I >know that the engine can solve a position for the wrong reason so I look in >positions and try to find the reasons for differences in order to learn(I know >that not always the version that seems better is the better version and there >may be cases when a version solved the problem for the wrong reasons). For these reasons we would call it not a good test. :) > >Uri
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.