Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: How to use a [cough] EPD test suite to estimate ELO

Author: Dann Corbit

Date: 11:36:05 02/12/99

Go up one level in this thread


On February 12, 1999 at 14:06:49, KarinsDad wrote:

>Could a large test suite (200+ positions) with random opening, middlegame, and
>endgame positions be created that could then be compared against programs?
>
>Would this make more sense as compared to the contrived test suites which
>attempt to have weird or difficult positions to analyze?
The hardest thing about test suites is making sure you have the right answer.
If the answer is easy to find, then it is not a worthy problem.  If the answer
is hard to find, are we sure that it is the only answer that is a good answer?
How can we be absolutely positive that the answer is the best answer.

I have about 8000 test positions.  Of those, 1000 answers in the test positions
were just plain wrong.  1500 were questionable (probably about half will be
right and not solved yet and half will be wrong but I am just making a wild
guess here).  Of the rest that did get 'solved' and the computer results agree,
what would happen if we took each of those problems and ran them for an entire
day on Deep Blue.  Is anyone so foolish as to believe that none of the answers
would change?  I will bet at least 10% would have different answers!  (See, for
instance, the "go deep" articles by Dr's Hyatt and Heinz).

So here's the real rub:
Your program gets 28 out of 30.  But it turns out that two of your right answers
are really *wrong*.  Funnier still, is that of the two you missed, one of them
was actually a much better answer than the 'best move'.

In short, EPD test suites are full of bugs.  Lots and lots of horrible bugs.
Finally, EPD tests are run incorrectly.  A large percentage of the solutions may
be 'accidental'.  In other words, choosing Qb2 may lead to checkmate.  It is the
best move.  Your program chose Qb2, but the eval is -1432 and your pv indicates
that you are about to slit your own throat.  But it is scored as "A CORRECT
ANSWER!"

I suggest the following:
A thoroughly debugged, large test suite could be created.  The problems could be
ranked from trivial to Deep Blue/Kasparov Killers.  Your program gets a score if
and only if it has not only the correct move chosen, but also the ce and pv
indicate that it actually sees a solution to the problem, not just some fluke
choice.  The problems are not weighted equally.   The trivial problems are not
worth much and the difficult problems are worth a lot.  The test suite would
then be calibrated against every machine/program in the SSDF and every GM
willing to give it a go.  Every ameteur program could be plied against the test,
and any player with a rating could give it a go.  All the data accumulated could
be used to create a test suite which accurately ranks players, whether machine
or human.  By having several thousand positions, you could weed out memorizers.
Besides which, anyone would could memorize several thousand positions and their
solutions could probably do well on their own anyway.  But even having all of
this, it would be not too difficult to create programs that could cheat against
the test.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.