Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: ONE Position out of 100 can't prove anything.

Author: Uri Blass
Date: 09:04:39 06/13/04
On June 13, 2004 at 10:44:15, Robert Hyatt wrote:

>On June 13, 2004 at 09:17:49, Uri Blass wrote:
>
>>On June 13, 2004 at 00:27:26, Robert Hyatt wrote:
>>
>>>On June 12, 2004 at 17:23:52, Mike S. wrote:
>>>
>>>>On June 12, 2004 at 11:32:03, Robert Hyatt wrote:
>>>>
>>>>>(...)
>>>>
>>>>>This shows that such tests are basically flawed.  The test should state "The
>>>>>time to solution is the time where the engine chooses the right move, and then
>>>>>sticks with it from that point forward, searching at least 30 minutes more..."
>>>>
>>>>Why "should..."?? This *is* the condition for a correct solution in the WM Test
>>>>and ever has been, with the exception that the max. time is 20 minutes/pos. A
>>>>solution is counted from the time when an engine has found *and kept* the
>>>>solution move until the full testing time of 20 minutes.
>>>>
>>>>Rolf fails to inform you about that, or he doesn't know it himself. Does that
>>>>surprise you?
>>>
>>>
>>>Nothing "surprises" me any longer.  Any more than I am surprised that someone
>>>thinks that a set of positions can predict a computer program's "rating".  :)
>>>
>>>I'll remind you of the story I have told before.  IM Larry Kaufman had such a
>>>set of test positions back in 1993.  He had used them to accurately predict the
>>>ratings of a few microcomputer programs after calibrating it against a group of
>>>_other_ microcomputer programs.  He wanted to run it against Cray Blitz between
>>>rounds at the Indy ACM event that year.  I agreed as we had machine time to do
>>>this.
>>>
>>>The result?  A rating of over 3000.  Made no sense.  Problem was that we solved
>>>_every_ tactical position with a reported time of zero seconds (we only measured
>>>time to the nearest second back then using an integer.)  First computation
>>>produced a divide by zero.  He and Don Daily decided "let's use .5 seconds for
>>>those zero second times since we know that they are really greater than zero but
>>>less than one second."  I said OK.  Second computation was over 3000.
>>>
>>>The final conclusion?  The formula and test set was no good if there was
>>>something about the tested program that was "outside the box" used to calibrate
>>>the original predictor function.  What was "outside the box"???  A Cray
>>>super-computer so much faster than the micros being tested that it was not even
>>>remotely funny to compare them.
>>>
>>>Forget about test sets predicting ratings.  It is a flawed idea from the get-go
>>>and only goes downhill from there...
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>(You can always claim that the test time is too short, but if you for example
>>>>run every position for a whole day, you'll still find engines which would switch
>>>>to a wrong move after 26 hours. So you have to draw a line somewhere - and 20
>>>>minutes/pos. is a time for "intensive analysis;" a normal game usually will
>>>>nearly never take more than 10 minutes per pos. and not more than 3 minutes/pos.
>>>>average...)
>>>
>>>
>>> Hardware changes.  So the max time has to evolve as well.  But since the basic
>>>idea is flawed so badly, it really doesn't matter.  I don't pay any attention to
>>>such estimated ratings.  Neither does anyone else that gives any serious thought
>>>to the concept...
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>http://www.computerschach.de/test/WM-Test.zip
>>>>(English version included, and results of 4 Crafties.)
>>>>
>>>>I hope you didn't assume the WM-Test authors and the complete audience who uses
>>>>it, are idiots who count a "pseudo solution" which is found i.e. after 12
>>>>seconds, when from 42 secs. to 7 min. an engine switches to a wrong move
>>>>etc.etc. ?? Of course not. A high percentage of CSS readers are experienced
>>>>advanced computerchess users (at least). CSS itself has built, informed and
>>>>developed that expert's audience (I guess the US has nothing comparable,
>>>>unfortunately). - Also, advice has been given to set the "extra plies" parameter
>>>>for automatic testsuite functions to 99, to ensure that the complete testing
>>>>time is used, for each position. But in general, we have recommended to test
>>>>manually and watch the engine's thinking process to get impressions so to speak.
>>>>
>>>>I'm a bit disappointed about your statement that "...such tests are basically
>>>>flawed.  The test should," when indeed it *does* just that.
>>>
>>>
>>>
>>>
>>>No, such tests are flawed, _period_.  The time was a minor point.  The idea does
>>>not work, has never worked, and never will work.  Yes, you can take a set of
>>>position, run them against a group of programs, and compute a formula that fits
>>>the test position solution times to the actual known ratings of the programs.
>>>And yes you can now predict the actual rating of any of those programs.  But
>>>guess what?  You _knew_ the rating before you started so computing it later is a
>>>bit useless.  But don't use that formula on a program that could be
>>>significantly different.  IE a program that is tactically weaker but
>>>positionally far stronger than the group used to calibrate the test will wreck
>>>the formula instantly.  Or in the case of Cray Blitz, a program that was _far_
>>>stronger tactically than any 1993 micro program simply blew out the mostly
>>>tactical test positions instantly.
>>>
>>>So it is flawed, but not _just_ because a program might change its mind later,
>>>and on faster hardware later becomes sooner.  It is flawed because it is just a
>>>flawed concept.  Positions can't be used like that.  Unless you put together a
>>>few thousand positions.
>>>
>>>
>>>
>>>>
>>>>>That stops this kind of nonsensical "faster = worse" problem.  Because as is,
>>>>>the test simply is meaningless when changing nothing but the hardware results in
>>>>>a poorer result...
>>>>
>>>>Are you aware that only some (few) of the positions are affected by that
>>>>problem? The WM-Test has 100 positions. Some engines show that behaviour in some
>>>>of the positions (different engines in different positions). Some fail to
>>>>finally solve due to that, some solve but would change to a wrong move after
>>>>20:00, etc.
>>>
>>>
>>>Then those positions should be thrown out.  Along with the rest that are just as
>>>worthless in testing an unknown program to predict its rating from an equation
>>>derived by fitting times to known programs...
>>>
>>>
>>>
>>>
>>>>
>>>>Can you guarantee that any single test position you use (and pls don't tell me
>>>>you use nove :-)) is not affected from that problem? Who can guarantee that?
>>>
>>>
>>>I use _zero_ test positions to improve Crafty.  Absolutely _zero_.  I have
>>>several I use to be sure I didn't break something badly, but I use _games_, and
>>>only games to measure improvements or steps backward...
>>
>>I agree that it is a mistake to trust test suites when you decide that a new
>>version is better but I think that it is a mistake not to use test positions
>>as first step to test improvement.
>
>When I add something new, such as the pawn majority code, or something similar,
>I create a few test positions to see if the code works as I intended.  All this
>is is a "go/no-go" test to see if the code appears to work and produce the kinds
>of answers expected.  Then I play games to see if it is "better".  It is
>possible that a new idea works but the program plays worse, because of the speed
>loss produced by the new code.  So even though it solves positions correctly
>more often than the old version, the thing is weaker.
>
>Positions are good for debugging.  Or for sanity-checking.  To make sure nothing
> was broken with new additions.  But that's all I use 'em for...
>
>
>
>
>>
>>If I do some change that is not in the evaluation then I first use test
>>positions to see if the new version is better.
>
>Your testing is flawed.  If, instead, you mean "I first use some test positions
>to see if the changes work as planned.." then I'll buy that.  But not to see if
>the new version is "better"...

I meant here if the new version is better in test suites.
The point is that if it is worse in test suites I prefer not to test the change
in games.

The change still may be productive but I have not unlimited time to test change
and I prefer one of the following:
1)to think how can I change the change to be productive in test suites before
trying it in games.

2)To reject the change and test another change.

Uri
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.