Author: Uri Blass
Date: 09:04:39 06/13/04
Go up one level in this thread
On June 13, 2004 at 10:44:15, Robert Hyatt wrote: >On June 13, 2004 at 09:17:49, Uri Blass wrote: > >>On June 13, 2004 at 00:27:26, Robert Hyatt wrote: >> >>>On June 12, 2004 at 17:23:52, Mike S. wrote: >>> >>>>On June 12, 2004 at 11:32:03, Robert Hyatt wrote: >>>> >>>>>(...) >>>> >>>>>This shows that such tests are basically flawed. The test should state "The >>>>>time to solution is the time where the engine chooses the right move, and then >>>>>sticks with it from that point forward, searching at least 30 minutes more..." >>>> >>>>Why "should..."?? This *is* the condition for a correct solution in the WM Test >>>>and ever has been, with the exception that the max. time is 20 minutes/pos. A >>>>solution is counted from the time when an engine has found *and kept* the >>>>solution move until the full testing time of 20 minutes. >>>> >>>>Rolf fails to inform you about that, or he doesn't know it himself. Does that >>>>surprise you? >>> >>> >>>Nothing "surprises" me any longer. Any more than I am surprised that someone >>>thinks that a set of positions can predict a computer program's "rating". :) >>> >>>I'll remind you of the story I have told before. IM Larry Kaufman had such a >>>set of test positions back in 1993. He had used them to accurately predict the >>>ratings of a few microcomputer programs after calibrating it against a group of >>>_other_ microcomputer programs. He wanted to run it against Cray Blitz between >>>rounds at the Indy ACM event that year. I agreed as we had machine time to do >>>this. >>> >>>The result? A rating of over 3000. Made no sense. Problem was that we solved >>>_every_ tactical position with a reported time of zero seconds (we only measured >>>time to the nearest second back then using an integer.) First computation >>>produced a divide by zero. He and Don Daily decided "let's use .5 seconds for >>>those zero second times since we know that they are really greater than zero but >>>less than one second." I said OK. Second computation was over 3000. >>> >>>The final conclusion? The formula and test set was no good if there was >>>something about the tested program that was "outside the box" used to calibrate >>>the original predictor function. What was "outside the box"??? A Cray >>>super-computer so much faster than the micros being tested that it was not even >>>remotely funny to compare them. >>> >>>Forget about test sets predicting ratings. It is a flawed idea from the get-go >>>and only goes downhill from there... >>> >>> >>> >>> >>> >>>> >>>>(You can always claim that the test time is too short, but if you for example >>>>run every position for a whole day, you'll still find engines which would switch >>>>to a wrong move after 26 hours. So you have to draw a line somewhere - and 20 >>>>minutes/pos. is a time for "intensive analysis;" a normal game usually will >>>>nearly never take more than 10 minutes per pos. and not more than 3 minutes/pos. >>>>average...) >>> >>> >>> Hardware changes. So the max time has to evolve as well. But since the basic >>>idea is flawed so badly, it really doesn't matter. I don't pay any attention to >>>such estimated ratings. Neither does anyone else that gives any serious thought >>>to the concept... >>> >>> >>> >>> >>> >>> >>> >>>> >>>>http://www.computerschach.de/test/WM-Test.zip >>>>(English version included, and results of 4 Crafties.) >>>> >>>>I hope you didn't assume the WM-Test authors and the complete audience who uses >>>>it, are idiots who count a "pseudo solution" which is found i.e. after 12 >>>>seconds, when from 42 secs. to 7 min. an engine switches to a wrong move >>>>etc.etc. ?? Of course not. A high percentage of CSS readers are experienced >>>>advanced computerchess users (at least). CSS itself has built, informed and >>>>developed that expert's audience (I guess the US has nothing comparable, >>>>unfortunately). - Also, advice has been given to set the "extra plies" parameter >>>>for automatic testsuite functions to 99, to ensure that the complete testing >>>>time is used, for each position. But in general, we have recommended to test >>>>manually and watch the engine's thinking process to get impressions so to speak. >>>> >>>>I'm a bit disappointed about your statement that "...such tests are basically >>>>flawed. The test should," when indeed it *does* just that. >>> >>> >>> >>> >>>No, such tests are flawed, _period_. The time was a minor point. The idea does >>>not work, has never worked, and never will work. Yes, you can take a set of >>>position, run them against a group of programs, and compute a formula that fits >>>the test position solution times to the actual known ratings of the programs. >>>And yes you can now predict the actual rating of any of those programs. But >>>guess what? You _knew_ the rating before you started so computing it later is a >>>bit useless. But don't use that formula on a program that could be >>>significantly different. IE a program that is tactically weaker but >>>positionally far stronger than the group used to calibrate the test will wreck >>>the formula instantly. Or in the case of Cray Blitz, a program that was _far_ >>>stronger tactically than any 1993 micro program simply blew out the mostly >>>tactical test positions instantly. >>> >>>So it is flawed, but not _just_ because a program might change its mind later, >>>and on faster hardware later becomes sooner. It is flawed because it is just a >>>flawed concept. Positions can't be used like that. Unless you put together a >>>few thousand positions. >>> >>> >>> >>>> >>>>>That stops this kind of nonsensical "faster = worse" problem. Because as is, >>>>>the test simply is meaningless when changing nothing but the hardware results in >>>>>a poorer result... >>>> >>>>Are you aware that only some (few) of the positions are affected by that >>>>problem? The WM-Test has 100 positions. Some engines show that behaviour in some >>>>of the positions (different engines in different positions). Some fail to >>>>finally solve due to that, some solve but would change to a wrong move after >>>>20:00, etc. >>> >>> >>>Then those positions should be thrown out. Along with the rest that are just as >>>worthless in testing an unknown program to predict its rating from an equation >>>derived by fitting times to known programs... >>> >>> >>> >>> >>>> >>>>Can you guarantee that any single test position you use (and pls don't tell me >>>>you use nove :-)) is not affected from that problem? Who can guarantee that? >>> >>> >>>I use _zero_ test positions to improve Crafty. Absolutely _zero_. I have >>>several I use to be sure I didn't break something badly, but I use _games_, and >>>only games to measure improvements or steps backward... >> >>I agree that it is a mistake to trust test suites when you decide that a new >>version is better but I think that it is a mistake not to use test positions >>as first step to test improvement. > >When I add something new, such as the pawn majority code, or something similar, >I create a few test positions to see if the code works as I intended. All this >is is a "go/no-go" test to see if the code appears to work and produce the kinds >of answers expected. Then I play games to see if it is "better". It is >possible that a new idea works but the program plays worse, because of the speed >loss produced by the new code. So even though it solves positions correctly >more often than the old version, the thing is weaker. > >Positions are good for debugging. Or for sanity-checking. To make sure nothing > was broken with new additions. But that's all I use 'em for... > > > > >> >>If I do some change that is not in the evaluation then I first use test >>positions to see if the new version is better. > >Your testing is flawed. If, instead, you mean "I first use some test positions >to see if the changes work as planned.." then I'll buy that. But not to see if >the new version is "better"... I meant here if the new version is better in test suites. The point is that if it is worse in test suites I prefer not to test the change in games. The change still may be productive but I have not unlimited time to test change and I prefer one of the following: 1)to think how can I change the change to be productive in test suites before trying it in games. 2)To reject the change and test another change. Uri
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.