Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: How to use a [cough] EPD test suite to estimate ELO

Author: KarinsDad

Date: 14:03:53 02/12/99

Go up one level in this thread


On February 12, 1999 at 14:36:05, Dann Corbit wrote:

>On February 12, 1999 at 14:06:49, KarinsDad wrote:
>
>>Could a large test suite (200+ positions) with random opening, middlegame, and
>>endgame positions be created that could then be compared against programs?
>>
>>Would this make more sense as compared to the contrived test suites which
>>attempt to have weird or difficult positions to analyze?
>The hardest thing about test suites is making sure you have the right answer.
>If the answer is easy to find, then it is not a worthy problem.  If the answer
>is hard to find, are we sure that it is the only answer that is a good answer?
>How can we be absolutely positive that the answer is the best answer.
>
>I have about 8000 test positions.  Of those, 1000 answers in the test positions
>were just plain wrong.  1500 were questionable (probably about half will be
>right and not solved yet and half will be wrong but I am just making a wild
>guess here).  Of the rest that did get 'solved' and the computer results agree,
>what would happen if we took each of those problems and ran them for an entire
>day on Deep Blue.  Is anyone so foolish as to believe that none of the answers
>would change?  I will bet at least 10% would have different answers!  (See, for
>instance, the "go deep" articles by Dr's Hyatt and Heinz).
>
>So here's the real rub:
>Your program gets 28 out of 30.  But it turns out that two of your right answers
>are really *wrong*.  Funnier still, is that of the two you missed, one of them
>was actually a much better answer than the 'best move'.
>
>In short, EPD test suites are full of bugs.  Lots and lots of horrible bugs.
>Finally, EPD tests are run incorrectly.  A large percentage of the solutions may
>be 'accidental'.  In other words, choosing Qb2 may lead to checkmate.  It is the
>best move.  Your program chose Qb2, but the eval is -1432 and your pv indicates
>that you are about to slit your own throat.  But it is scored as "A CORRECT
>ANSWER!"
>
>I suggest the following:
>A thoroughly debugged, large test suite could be created.  The problems could be
>ranked from trivial to Deep Blue/Kasparov Killers.  Your program gets a score if
>and only if it has not only the correct move chosen, but also the ce and pv
>indicate that it actually sees a solution to the problem, not just some fluke
>choice.

But this is irrelevant. Isn't it? If you have a large enough sample set, the
correct answer is correct, regardless of whether the computer thought it was
correct for a different reason. When you actually have computers or humans play
games, it doesn't matter why a move is choosen, just that given a large sample
set of games, there is a win-draw-loss record against opponents of certain
"strength" and hence, you have a rating accordingly.

Computers and humans both blunder into bad positions either for poor evaluation
reasons, or when bad positions are beyond their event horizon. Your skill comes
in based on the percentage of times that you make "correct" or nearly correct
moves and the percentage of times you make outright blunders. So if a program
detects a correct move, then for that position, it did ok, regardless of the
reasons why.

Stating that you must get not only the correct move, but the pv and the ce (what
is a ce? I asked this question before and do not think I got a response, sounds
like an evaluation number) seems overkill.

>  The problems are not weighted equally.   The trivial problems are not
>worth much and the difficult problems are worth a lot.  The test suite would
>then be calibrated against every machine/program in the SSDF and every GM
>willing to give it a go.  Every ameteur program could be plied against the test,
>and any player with a rating could give it a go.  All the data accumulated could
>be used to create a test suite which accurately ranks players, whether machine
>or human.  By having several thousand positions, you could weed out memorizers.
>Besides which, anyone would could memorize several thousand positions and their
>solutions could probably do well on their own anyway.  But even having all of
>this, it would be not too difficult to create programs that could cheat against
>the test.

Cheating against the test is not the question, is it? The question is whether a
test can be devised that could approximate rating. If someone cheats, what have
they accomplished? Once their program goes head to head with others in
competition, then their true rating will eventually shine through.

KarinsDad



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.