Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: ONE Position out of 100 can't prove anything.

Author: Uri Blass
Date: 06:17:49 06/13/04
On June 13, 2004 at 00:27:26, Robert Hyatt wrote:

>On June 12, 2004 at 17:23:52, Mike S. wrote:
>
>>On June 12, 2004 at 11:32:03, Robert Hyatt wrote:
>>
>>>(...)
>>
>>>This shows that such tests are basically flawed.  The test should state "The
>>>time to solution is the time where the engine chooses the right move, and then
>>>sticks with it from that point forward, searching at least 30 minutes more..."
>>
>>Why "should..."?? This *is* the condition for a correct solution in the WM Test
>>and ever has been, with the exception that the max. time is 20 minutes/pos. A
>>solution is counted from the time when an engine has found *and kept* the
>>solution move until the full testing time of 20 minutes.
>>
>>Rolf fails to inform you about that, or he doesn't know it himself. Does that
>>surprise you?
>
>
>Nothing "surprises" me any longer.  Any more than I am surprised that someone
>thinks that a set of positions can predict a computer program's "rating".  :)
>
>I'll remind you of the story I have told before.  IM Larry Kaufman had such a
>set of test positions back in 1993.  He had used them to accurately predict the
>ratings of a few microcomputer programs after calibrating it against a group of
>_other_ microcomputer programs.  He wanted to run it against Cray Blitz between
>rounds at the Indy ACM event that year.  I agreed as we had machine time to do
>this.
>
>The result?  A rating of over 3000.  Made no sense.  Problem was that we solved
>_every_ tactical position with a reported time of zero seconds (we only measured
>time to the nearest second back then using an integer.)  First computation
>produced a divide by zero.  He and Don Daily decided "let's use .5 seconds for
>those zero second times since we know that they are really greater than zero but
>less than one second."  I said OK.  Second computation was over 3000.
>
>The final conclusion?  The formula and test set was no good if there was
>something about the tested program that was "outside the box" used to calibrate
>the original predictor function.  What was "outside the box"???  A Cray
>super-computer so much faster than the micros being tested that it was not even
>remotely funny to compare them.
>
>Forget about test sets predicting ratings.  It is a flawed idea from the get-go
>and only goes downhill from there...
>
>
>
>
>
>>
>>(You can always claim that the test time is too short, but if you for example
>>run every position for a whole day, you'll still find engines which would switch
>>to a wrong move after 26 hours. So you have to draw a line somewhere - and 20
>>minutes/pos. is a time for "intensive analysis;" a normal game usually will
>>nearly never take more than 10 minutes per pos. and not more than 3 minutes/pos.
>>average...)
>
>
> Hardware changes.  So the max time has to evolve as well.  But since the basic
>idea is flawed so badly, it really doesn't matter.  I don't pay any attention to
>such estimated ratings.  Neither does anyone else that gives any serious thought
>to the concept...
>
>
>
>
>
>
>
>>
>>http://www.computerschach.de/test/WM-Test.zip
>>(English version included, and results of 4 Crafties.)
>>
>>I hope you didn't assume the WM-Test authors and the complete audience who uses
>>it, are idiots who count a "pseudo solution" which is found i.e. after 12
>>seconds, when from 42 secs. to 7 min. an engine switches to a wrong move
>>etc.etc. ?? Of course not. A high percentage of CSS readers are experienced
>>advanced computerchess users (at least). CSS itself has built, informed and
>>developed that expert's audience (I guess the US has nothing comparable,
>>unfortunately). - Also, advice has been given to set the "extra plies" parameter
>>for automatic testsuite functions to 99, to ensure that the complete testing
>>time is used, for each position. But in general, we have recommended to test
>>manually and watch the engine's thinking process to get impressions so to speak.
>>
>>I'm a bit disappointed about your statement that "...such tests are basically
>>flawed.  The test should," when indeed it *does* just that.
>
>
>
>
>No, such tests are flawed, _period_.  The time was a minor point.  The idea does
>not work, has never worked, and never will work.  Yes, you can take a set of
>position, run them against a group of programs, and compute a formula that fits
>the test position solution times to the actual known ratings of the programs.
>And yes you can now predict the actual rating of any of those programs.  But
>guess what?  You _knew_ the rating before you started so computing it later is a
>bit useless.  But don't use that formula on a program that could be
>significantly different.  IE a program that is tactically weaker but
>positionally far stronger than the group used to calibrate the test will wreck
>the formula instantly.  Or in the case of Cray Blitz, a program that was _far_
>stronger tactically than any 1993 micro program simply blew out the mostly
>tactical test positions instantly.
>
>So it is flawed, but not _just_ because a program might change its mind later,
>and on faster hardware later becomes sooner.  It is flawed because it is just a
>flawed concept.  Positions can't be used like that.  Unless you put together a
>few thousand positions.
>
>
>
>>
>>>That stops this kind of nonsensical "faster = worse" problem.  Because as is,
>>>the test simply is meaningless when changing nothing but the hardware results in
>>>a poorer result...
>>
>>Are you aware that only some (few) of the positions are affected by that
>>problem? The WM-Test has 100 positions. Some engines show that behaviour in some
>>of the positions (different engines in different positions). Some fail to
>>finally solve due to that, some solve but would change to a wrong move after
>>20:00, etc.
>
>
>Then those positions should be thrown out.  Along with the rest that are just as
>worthless in testing an unknown program to predict its rating from an equation
>derived by fitting times to known programs...
>
>
>
>
>>
>>Can you guarantee that any single test position you use (and pls don't tell me
>>you use nove :-)) is not affected from that problem? Who can guarantee that?
>
>
>I use _zero_ test positions to improve Crafty.  Absolutely _zero_.  I have
>several I use to be sure I didn't break something badly, but I use _games_, and
>only games to measure improvements or steps backward...

I agree that it is a mistake to trust test suites when you decide that a new
version is better but I think that it is a mistake not to use test positions
as first step to test improvement.

If I do some change that is not in the evaluation then I first use test
positions to see if the new version is better.

Only if I find that the new version is not worse in test suites I go to the next
step that is testing in games(sometimes not worse is not enough and I decide to
test in games only if the new version is better).

I have limited time to spend on testing and if the new thing is worse in test
suites I prefer not to waste time on testing it in games.

Uri
Re: ONE Position out of 100 can't prove anything. Robert Hyatt 07:44:15 06/13/04
- Re: ONE Position out of 100 can't prove anything. Uri Blass 09:04:39 06/13/04
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.