Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: ONE Position out of 100 can't prove anything.

Author: Robert Hyatt

Date: 21:27:26 06/12/04

On June 12, 2004 at 17:23:52, Mike S. wrote:

>On June 12, 2004 at 11:32:03, Robert Hyatt wrote:
>
>>(...)
>
>>This shows that such tests are basically flawed.  The test should state "The
>>time to solution is the time where the engine chooses the right move, and then
>>sticks with it from that point forward, searching at least 30 minutes more..."
>
>Why "should..."?? This *is* the condition for a correct solution in the WM Test
>and ever has been, with the exception that the max. time is 20 minutes/pos. A
>solution is counted from the time when an engine has found *and kept* the
>solution move until the full testing time of 20 minutes.
>
>Rolf fails to inform you about that, or he doesn't know it himself. Does that
>surprise you?

Nothing "surprises" me any longer.  Any more than I am surprised that someone
thinks that a set of positions can predict a computer program's "rating".  :)

I'll remind you of the story I have told before.  IM Larry Kaufman had such a
set of test positions back in 1993.  He had used them to accurately predict the
ratings of a few microcomputer programs after calibrating it against a group of
_other_ microcomputer programs.  He wanted to run it against Cray Blitz between
rounds at the Indy ACM event that year.  I agreed as we had machine time to do
this.

The result?  A rating of over 3000.  Made no sense.  Problem was that we solved
_every_ tactical position with a reported time of zero seconds (we only measured
time to the nearest second back then using an integer.)  First computation
produced a divide by zero.  He and Don Daily decided "let's use .5 seconds for
those zero second times since we know that they are really greater than zero but
less than one second."  I said OK.  Second computation was over 3000.

The final conclusion?  The formula and test set was no good if there was
something about the tested program that was "outside the box" used to calibrate
the original predictor function.  What was "outside the box"???  A Cray
super-computer so much faster than the micros being tested that it was not even
remotely funny to compare them.

Forget about test sets predicting ratings.  It is a flawed idea from the get-go
and only goes downhill from there...

>
>(You can always claim that the test time is too short, but if you for example
>run every position for a whole day, you'll still find engines which would switch
>to a wrong move after 26 hours. So you have to draw a line somewhere - and 20
>minutes/pos. is a time for "intensive analysis;" a normal game usually will
>nearly never take more than 10 minutes per pos. and not more than 3 minutes/pos.
>average...)

 Hardware changes.  So the max time has to evolve as well.  But since the basic
idea is flawed so badly, it really doesn't matter.  I don't pay any attention to
such estimated ratings.  Neither does anyone else that gives any serious thought
to the concept...

>
>http://www.computerschach.de/test/WM-Test.zip
>(English version included, and results of 4 Crafties.)
>
>I hope you didn't assume the WM-Test authors and the complete audience who uses
>it, are idiots who count a "pseudo solution" which is found i.e. after 12
>seconds, when from 42 secs. to 7 min. an engine switches to a wrong move
>etc.etc. ?? Of course not. A high percentage of CSS readers are experienced
>advanced computerchess users (at least). CSS itself has built, informed and
>developed that expert's audience (I guess the US has nothing comparable,
>unfortunately). - Also, advice has been given to set the "extra plies" parameter
>for automatic testsuite functions to 99, to ensure that the complete testing
>time is used, for each position. But in general, we have recommended to test
>manually and watch the engine's thinking process to get impressions so to speak.
>
>I'm a bit disappointed about your statement that "...such tests are basically
>flawed.  The test should," when indeed it *does* just that.

No, such tests are flawed, _period_.  The time was a minor point.  The idea does
not work, has never worked, and never will work.  Yes, you can take a set of
position, run them against a group of programs, and compute a formula that fits
the test position solution times to the actual known ratings of the programs.
And yes you can now predict the actual rating of any of those programs.  But
guess what?  You _knew_ the rating before you started so computing it later is a
bit useless.  But don't use that formula on a program that could be
significantly different.  IE a program that is tactically weaker but
positionally far stronger than the group used to calibrate the test will wreck
the formula instantly.  Or in the case of Cray Blitz, a program that was _far_
stronger tactically than any 1993 micro program simply blew out the mostly
tactical test positions instantly.

So it is flawed, but not _just_ because a program might change its mind later,
and on faster hardware later becomes sooner.  It is flawed because it is just a
flawed concept.  Positions can't be used like that.  Unless you put together a
few thousand positions.

>
>>That stops this kind of nonsensical "faster = worse" problem.  Because as is,
>>the test simply is meaningless when changing nothing but the hardware results in
>>a poorer result...
>
>Are you aware that only some (few) of the positions are affected by that
>problem? The WM-Test has 100 positions. Some engines show that behaviour in some
>of the positions (different engines in different positions). Some fail to
>finally solve due to that, some solve but would change to a wrong move after
>20:00, etc.

Then those positions should be thrown out.  Along with the rest that are just as
worthless in testing an unknown program to predict its rating from an equation
derived by fitting times to known programs...

>
>Can you guarantee that any single test position you use (and pls don't tell me
>you use nove :-)) is not affected from that problem? Who can guarantee that?

I use _zero_ test positions to improve Crafty.  Absolutely _zero_.  I have
several I use to be sure I didn't break something badly, but I use _games_, and
only games to measure improvements or steps backward...

>Engines are creative in finding ways to decide for the correct move, but for the
>wrong reason, sometimes... You are aware that it is very difficult to avoid it
>to 100%, especially when a large test suite is compiled?

Correct.  Let me remind you of my words again:  "Using a test set to estimate
the rating of a program is simply a flawed idea from the get-go.  period."

>So please be fair.
>
>Regards,
>Mike Scheidl

Fair isn't part of the game.  This is about "will it work, or will it not work?"
 The answer is "It will not work".

Re: ONE Position out of 100 can't prove anything. Uri Blass 06:17:49 06/13/04
- Re: ONE Position out of 100 can't prove anything. Robert Hyatt 07:44:15 06/13/04
  - Re: ONE Position out of 100 can't prove anything. Uri Blass 09:04:39 06/13/04
Re: I wasn't talking about ratings! (n.t.) Mike S. 05:19:46 06/13/04
a little about the used formula (info to Rob Hyatt) Franz Hagra 04:46:01 06/13/04
- Re: a little about the used formula (info to Rob Hyatt) Robert Hyatt 07:46:06 06/13/04

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.