Author: Rolf Tueschen
Date: 09:53:50 06/20/04
Go up one level in this thread
On June 20, 2004 at 12:39:54, Uri Blass wrote: >On June 20, 2004 at 12:34:43, Rolf Tueschen wrote: > >>On June 20, 2004 at 12:10:57, David Dahlem wrote: >> >>>I've seen numerous examples of one engine solving a test suite position in a few >>>seconds, while another engine of known equal game playing strength never finds >>>the solution, even after hours of analysis. To me, this makes test suites >>>worthless, or at least very difficult to interpret the results. >>> >>>Regards >>>Dave >> >> >>Yes, correct, this is what is called the lack of reliability of the results, as >>Sandro explained. It's a typical wrong with these position tests, but all test >>knowies know it, however the question is how to explain that triviality to lays >>and motivated users and to a founder with a blind spot? In special who is losing >>himself in the circle argument that every critic at first should run the test >>suite because they would THEN realize how good it is. You know from the chess >>quality of these positions on...! I can only repeat this: a famous CC journal >>and a whole team of forum mods who don't want to "hurt" a test founder and so >>tolerate that he loses himself in such a circle - is the main responsible for >>that mess. Because that someone, even a scientist, _can_ go wrong and can't >>realize this, that is not such a seldom event. It doesn't mean that he's bad or >>not intelligent or such. Sometimes you have this "wall" in your head. And you >>can't find a brick. Later you break out into laughter and you wonder why you >>couldn't see it. Here in our case the main founder is a Russian academic doctor >>who certainly has learned the basics of scientific reasoning. Therefore he will >>understand in the end the difference between testing the end-product or a >>prototype. He does also know these two obstacles, namely validity and >>reliability. And he should know that statistical calculation could never >>"create" significance if it's not in the data. >> >>I do also think that we must change a couple of terms. When the users are >>playing with their engines and run them through 100 positions, this can't be >>called "testing"! It looks like but it's not testing. > >It's testing. > >It tests exactly which engine performs the best in the 100 positions that you >choose. >Which engine is the best is a different question. > >Uri Now you fell into a trap, Uri. Of course you can see who does "best" or "not so good" but you can't know what it means!! Because programmers know that you must only tweak a bit your code and whoopie - your program becomes better and better in these positions - BUT - BUT - BUT the playing strength didn't increase at the same time. Worse - you know that if you tweak into that direction the playing strength _decreases_! Would you call it a "good" result??? The users who run these positions with their *ready made* products seemingly research and test and can also make a ranking, but it has almost nothing to do with playing strength in tournaments. If SHREDDER then is in the top ranks this is NOT because this test is good but because the tested SHREDDER was only released because it played "well". So - this result is nothing "new". Hence you can't call it "testing".
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.