Author: Uri Blass
Date: 06:17:49 06/13/04
Go up one level in this thread
On June 13, 2004 at 00:27:26, Robert Hyatt wrote: >On June 12, 2004 at 17:23:52, Mike S. wrote: > >>On June 12, 2004 at 11:32:03, Robert Hyatt wrote: >> >>>(...) >> >>>This shows that such tests are basically flawed. The test should state "The >>>time to solution is the time where the engine chooses the right move, and then >>>sticks with it from that point forward, searching at least 30 minutes more..." >> >>Why "should..."?? This *is* the condition for a correct solution in the WM Test >>and ever has been, with the exception that the max. time is 20 minutes/pos. A >>solution is counted from the time when an engine has found *and kept* the >>solution move until the full testing time of 20 minutes. >> >>Rolf fails to inform you about that, or he doesn't know it himself. Does that >>surprise you? > > >Nothing "surprises" me any longer. Any more than I am surprised that someone >thinks that a set of positions can predict a computer program's "rating". :) > >I'll remind you of the story I have told before. IM Larry Kaufman had such a >set of test positions back in 1993. He had used them to accurately predict the >ratings of a few microcomputer programs after calibrating it against a group of >_other_ microcomputer programs. He wanted to run it against Cray Blitz between >rounds at the Indy ACM event that year. I agreed as we had machine time to do >this. > >The result? A rating of over 3000. Made no sense. Problem was that we solved >_every_ tactical position with a reported time of zero seconds (we only measured >time to the nearest second back then using an integer.) First computation >produced a divide by zero. He and Don Daily decided "let's use .5 seconds for >those zero second times since we know that they are really greater than zero but >less than one second." I said OK. Second computation was over 3000. > >The final conclusion? The formula and test set was no good if there was >something about the tested program that was "outside the box" used to calibrate >the original predictor function. What was "outside the box"??? A Cray >super-computer so much faster than the micros being tested that it was not even >remotely funny to compare them. > >Forget about test sets predicting ratings. It is a flawed idea from the get-go >and only goes downhill from there... > > > > > >> >>(You can always claim that the test time is too short, but if you for example >>run every position for a whole day, you'll still find engines which would switch >>to a wrong move after 26 hours. So you have to draw a line somewhere - and 20 >>minutes/pos. is a time for "intensive analysis;" a normal game usually will >>nearly never take more than 10 minutes per pos. and not more than 3 minutes/pos. >>average...) > > > Hardware changes. So the max time has to evolve as well. But since the basic >idea is flawed so badly, it really doesn't matter. I don't pay any attention to >such estimated ratings. Neither does anyone else that gives any serious thought >to the concept... > > > > > > > >> >>http://www.computerschach.de/test/WM-Test.zip >>(English version included, and results of 4 Crafties.) >> >>I hope you didn't assume the WM-Test authors and the complete audience who uses >>it, are idiots who count a "pseudo solution" which is found i.e. after 12 >>seconds, when from 42 secs. to 7 min. an engine switches to a wrong move >>etc.etc. ?? Of course not. A high percentage of CSS readers are experienced >>advanced computerchess users (at least). CSS itself has built, informed and >>developed that expert's audience (I guess the US has nothing comparable, >>unfortunately). - Also, advice has been given to set the "extra plies" parameter >>for automatic testsuite functions to 99, to ensure that the complete testing >>time is used, for each position. But in general, we have recommended to test >>manually and watch the engine's thinking process to get impressions so to speak. >> >>I'm a bit disappointed about your statement that "...such tests are basically >>flawed. The test should," when indeed it *does* just that. > > > > >No, such tests are flawed, _period_. The time was a minor point. The idea does >not work, has never worked, and never will work. Yes, you can take a set of >position, run them against a group of programs, and compute a formula that fits >the test position solution times to the actual known ratings of the programs. >And yes you can now predict the actual rating of any of those programs. But >guess what? You _knew_ the rating before you started so computing it later is a >bit useless. But don't use that formula on a program that could be >significantly different. IE a program that is tactically weaker but >positionally far stronger than the group used to calibrate the test will wreck >the formula instantly. Or in the case of Cray Blitz, a program that was _far_ >stronger tactically than any 1993 micro program simply blew out the mostly >tactical test positions instantly. > >So it is flawed, but not _just_ because a program might change its mind later, >and on faster hardware later becomes sooner. It is flawed because it is just a >flawed concept. Positions can't be used like that. Unless you put together a >few thousand positions. > > > >> >>>That stops this kind of nonsensical "faster = worse" problem. Because as is, >>>the test simply is meaningless when changing nothing but the hardware results in >>>a poorer result... >> >>Are you aware that only some (few) of the positions are affected by that >>problem? The WM-Test has 100 positions. Some engines show that behaviour in some >>of the positions (different engines in different positions). Some fail to >>finally solve due to that, some solve but would change to a wrong move after >>20:00, etc. > > >Then those positions should be thrown out. Along with the rest that are just as >worthless in testing an unknown program to predict its rating from an equation >derived by fitting times to known programs... > > > > >> >>Can you guarantee that any single test position you use (and pls don't tell me >>you use nove :-)) is not affected from that problem? Who can guarantee that? > > >I use _zero_ test positions to improve Crafty. Absolutely _zero_. I have >several I use to be sure I didn't break something badly, but I use _games_, and >only games to measure improvements or steps backward... I agree that it is a mistake to trust test suites when you decide that a new version is better but I think that it is a mistake not to use test positions as first step to test improvement. If I do some change that is not in the evaluation then I first use test positions to see if the new version is better. Only if I find that the new version is not worse in test suites I go to the next step that is testing in games(sometimes not worse is not enough and I decide to test in games only if the new version is better). I have limited time to spend on testing and if the new thing is worse in test suites I prefer not to waste time on testing it in games. Uri
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.