Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: How to use a [cough] EPD test suite to estimate ELO

Author: Enrique Irazoqui

Date: 16:37:31 02/11/99

Go up one level in this thread


On February 11, 1999 at 16:25:10, Bruce Moreland wrote:

>
>On February 11, 1999 at 15:41:25, Dann Corbit wrote:
>
>>Andreas Schwartmann asked an interesting question in r.g.c.c.:
>>"I wonder if anyone can enlighten me on how to use various test suites, like
>>LCT, LCT II and Covax. There are ceratin formulas on how to calculate the
>>playing strength according to these test suites, right?"
>>
>>Now, ignoring the fact that they are full of bugs and the measures are probably
>>bogus, how *does* one arrive at an ELO from a test suite evaluation?
>>
>>What is the actual mathematical basis for the calculations?
>
>You come up with a formula that turns the times into an Elo rating, then check
>against a reference set of programs, and if there is not a good match, go back
>to the beginning of this sentence.
>
>The test becomes a very good predictor for those programs, which is obviously no
>big deal, since the formula has been constructed after the test has been run,
>and is *designed* to predict well for those programs.  If you wanted, you could
>predict Elo rating based upon the letters in the program's name, and you'd also
>get a good predictor.
>
>The question is whether the suites are measuring something that has to do with
>chess strength.  It makes sense that there is at least some connection, since
>the problems are typically middlegame tactical or positional problems, and
>everybody knows that tactical and positional speed are components of strength.
>So for non-reference programs, perhaps they are comparing something that is
>grossly related to chess strength.
>
>But you have to keep in mind that for the reference programs, the scores
>produced are the scores that the suite builder wanted to produce.  It's a bad
>trap to assume that someone's BS2830 scores back up their SSDF rating, if the
>BS2830 suite was calibrated using SSDF ratings as inputs.
>
>I would never trust Elo numbers produced by a test suite.  I think it makes more
>sense to give the scores in a way that keeps them from looking like Elo ratings,
>so there wouldn't be the tendency to use the scores as Elo ratings, and the
>scoring formula could be less complex, too.

I couldn't agree more with the whole post.

Once I hoped that test sets would give me a quick and valid estimation of the
strength of a program. Unfortunately, it is not true. Positions of the best
known tests are cooked often enough as to make them meaningless; tests
themselves fail to reflect real life conditions; positions can be ambiguous and
the answers plain wrong; formulas like the BS2830 test make no sense
whatsoever...

I ran the BT2630 and the BS2830 tactical tests with 18 top programs. Fritz 5.03
and 5.16 scored the worst of all in the BT (???) and the best in the BS; Junior
4.6 scores much better than Junior 5 in the BS (???) and much worse in the BT.;
etc. etc. etc. I tried to average the 57 positions of both and the result makes
no sense at all. There is a lot of "backwards engineering" in the building of
test suites, so they fit with current ratings, but when a new program arrives
all results become suddenly unreal.

Enrique

>Sorry that this post is somewhat scattered, there is an angry kid in the next
>room.
>
>bruce



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.