Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: ONE Position out of 100 can't prove anything.

Author: Robert Hyatt
Date: 07:44:15 06/13/04
On June 13, 2004 at 09:17:49, Uri Blass wrote:

>On June 13, 2004 at 00:27:26, Robert Hyatt wrote:
>
>>On June 12, 2004 at 17:23:52, Mike S. wrote:
>>
>>>On June 12, 2004 at 11:32:03, Robert Hyatt wrote:
>>>
>>>>(...)
>>>
>>>>This shows that such tests are basically flawed.  The test should state "The
>>>>time to solution is the time where the engine chooses the right move, and then
>>>>sticks with it from that point forward, searching at least 30 minutes more..."
>>>
>>>Why "should..."?? This *is* the condition for a correct solution in the WM Test
>>>and ever has been, with the exception that the max. time is 20 minutes/pos. A
>>>solution is counted from the time when an engine has found *and kept* the
>>>solution move until the full testing time of 20 minutes.
>>>
>>>Rolf fails to inform you about that, or he doesn't know it himself. Does that
>>>surprise you?
>>
>>
>>Nothing "surprises" me any longer.  Any more than I am surprised that someone
>>thinks that a set of positions can predict a computer program's "rating".  :)
>>
>>I'll remind you of the story I have told before.  IM Larry Kaufman had such a
>>set of test positions back in 1993.  He had used them to accurately predict the
>>ratings of a few microcomputer programs after calibrating it against a group of
>>_other_ microcomputer programs.  He wanted to run it against Cray Blitz between
>>rounds at the Indy ACM event that year.  I agreed as we had machine time to do
>>this.
>>
>>The result?  A rating of over 3000.  Made no sense.  Problem was that we solved
>>_every_ tactical position with a reported time of zero seconds (we only measured
>>time to the nearest second back then using an integer.)  First computation
>>produced a divide by zero.  He and Don Daily decided "let's use .5 seconds for
>>those zero second times since we know that they are really greater than zero but
>>less than one second."  I said OK.  Second computation was over 3000.
>>
>>The final conclusion?  The formula and test set was no good if there was
>>something about the tested program that was "outside the box" used to calibrate
>>the original predictor function.  What was "outside the box"???  A Cray
>>super-computer so much faster than the micros being tested that it was not even
>>remotely funny to compare them.
>>
>>Forget about test sets predicting ratings.  It is a flawed idea from the get-go
>>and only goes downhill from there...
>>
>>
>>
>>
>>
>>>
>>>(You can always claim that the test time is too short, but if you for example
>>>run every position for a whole day, you'll still find engines which would switch
>>>to a wrong move after 26 hours. So you have to draw a line somewhere - and 20
>>>minutes/pos. is a time for "intensive analysis;" a normal game usually will
>>>nearly never take more than 10 minutes per pos. and not more than 3 minutes/pos.
>>>average...)
>>
>>
>> Hardware changes.  So the max time has to evolve as well.  But since the basic
>>idea is flawed so badly, it really doesn't matter.  I don't pay any attention to
>>such estimated ratings.  Neither does anyone else that gives any serious thought
>>to the concept...
>>
>>
>>
>>
>>
>>
>>
>>>
>>>http://www.computerschach.de/test/WM-Test.zip
>>>(English version included, and results of 4 Crafties.)
>>>
>>>I hope you didn't assume the WM-Test authors and the complete audience who uses
>>>it, are idiots who count a "pseudo solution" which is found i.e. after 12
>>>seconds, when from 42 secs. to 7 min. an engine switches to a wrong move
>>>etc.etc. ?? Of course not. A high percentage of CSS readers are experienced
>>>advanced computerchess users (at least). CSS itself has built, informed and
>>>developed that expert's audience (I guess the US has nothing comparable,
>>>unfortunately). - Also, advice has been given to set the "extra plies" parameter
>>>for automatic testsuite functions to 99, to ensure that the complete testing
>>>time is used, for each position. But in general, we have recommended to test
>>>manually and watch the engine's thinking process to get impressions so to speak.
>>>
>>>I'm a bit disappointed about your statement that "...such tests are basically
>>>flawed.  The test should," when indeed it *does* just that.
>>
>>
>>
>>
>>No, such tests are flawed, _period_.  The time was a minor point.  The idea does
>>not work, has never worked, and never will work.  Yes, you can take a set of
>>position, run them against a group of programs, and compute a formula that fits
>>the test position solution times to the actual known ratings of the programs.
>>And yes you can now predict the actual rating of any of those programs.  But
>>guess what?  You _knew_ the rating before you started so computing it later is a
>>bit useless.  But don't use that formula on a program that could be
>>significantly different.  IE a program that is tactically weaker but
>>positionally far stronger than the group used to calibrate the test will wreck
>>the formula instantly.  Or in the case of Cray Blitz, a program that was _far_
>>stronger tactically than any 1993 micro program simply blew out the mostly
>>tactical test positions instantly.
>>
>>So it is flawed, but not _just_ because a program might change its mind later,
>>and on faster hardware later becomes sooner.  It is flawed because it is just a
>>flawed concept.  Positions can't be used like that.  Unless you put together a
>>few thousand positions.
>>
>>
>>
>>>
>>>>That stops this kind of nonsensical "faster = worse" problem.  Because as is,
>>>>the test simply is meaningless when changing nothing but the hardware results in
>>>>a poorer result...
>>>
>>>Are you aware that only some (few) of the positions are affected by that
>>>problem? The WM-Test has 100 positions. Some engines show that behaviour in some
>>>of the positions (different engines in different positions). Some fail to
>>>finally solve due to that, some solve but would change to a wrong move after
>>>20:00, etc.
>>
>>
>>Then those positions should be thrown out.  Along with the rest that are just as
>>worthless in testing an unknown program to predict its rating from an equation
>>derived by fitting times to known programs...
>>
>>
>>
>>
>>>
>>>Can you guarantee that any single test position you use (and pls don't tell me
>>>you use nove :-)) is not affected from that problem? Who can guarantee that?
>>
>>
>>I use _zero_ test positions to improve Crafty.  Absolutely _zero_.  I have
>>several I use to be sure I didn't break something badly, but I use _games_, and
>>only games to measure improvements or steps backward...
>
>I agree that it is a mistake to trust test suites when you decide that a new
>version is better but I think that it is a mistake not to use test positions
>as first step to test improvement.

When I add something new, such as the pawn majority code, or something similar,
I create a few test positions to see if the code works as I intended.  All this
is is a "go/no-go" test to see if the code appears to work and produce the kinds
of answers expected.  Then I play games to see if it is "better".  It is
possible that a new idea works but the program plays worse, because of the speed
loss produced by the new code.  So even though it solves positions correctly
more often than the old version, the thing is weaker.

Positions are good for debugging.  Or for sanity-checking.  To make sure nothing
 was broken with new additions.  But that's all I use 'em for...




>
>If I do some change that is not in the evaluation then I first use test
>positions to see if the new version is better.

Your testing is flawed.  If, instead, you mean "I first use some test positions
to see if the changes work as planned.." then I'll buy that.  But not to see if
the new version is "better"...


>
>Only if I find that the new version is not worse in test suites I go to the next
>step that is testing in games(sometimes not worse is not enough and I decide to
>test in games only if the new version is better).
>
>I have limited time to spend on testing and if the new thing is worse in test
>suites I prefer not to waste time on testing it in games.
>
>Uri
Re: ONE Position out of 100 can't prove anything. Uri Blass 09:04:39 06/13/04
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.