Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Test suites - can they reliably predict ELO?

Author: Roger

Date: 03:51:41 12/12/99

Go up one level in this thread


On December 11, 1999 at 17:52:56, Tom King wrote:

>Which of the well known test suites predicts the strength of chess programs most
>accurately?
>
>I ask this, because I recently made some *slight* mods. to the evaluation
>function in my program, Francesca. I ran the LCT-2 suite, and the results
>indicated that it was a wash - the modification gave me about 5 ELO points,
>apparently.
>
>I then ran a series of fast games against another amateur program. I realize
>it's important to play a large number of games, to reduce the margin of error,
>so I ran two matches of 65 games. The result was this:
>
>MATCH 1
>"Normal" Francesca scored 37% against the amateur program.
>
>MATCH 2
>"Modified" Francesca scored 45% against the amateur program.
>
>Quite a difference! It implies that the modification is worth over 50 ELO. I
>guess I need to play more games, against a variety of programs to verify whether
>this improvement is real, or imaginary.
>
>Anyhow, beware of reading too much into ELO predictions of test suites..
>
>Cheers All,
>Tom

One of the problems with test suites is that they are too small. Some people
have tried to predict ELO from test suites using multiple regression, but I
don't know any of the specifics. I assume they treated time to solution for each
position in the suite as a predictor variable, but I don't know this.

One of the problems in the social science is that published research often
relies on samples that are very small, which allows regression to take advantage
of chance relationships in the data. In other words, the predictions are always
too good, and they get better the smaller the sample is.

To have any chance at all of predicting ELOs from a test suite, you'd need a
large number of games that sampled all aspects of chess play. In other words,
you'd need a taxonomy of chess positions, and a large number of samples in each.
I think you'd also need a taxonomy of mistakes that computers tend to make, like
null move problems, and positions that sampled those mistakes.

To develop the predictions equations, you'd create two pools of computer
ratings. One set would be used to develop the equations. Then, these equations
would be used to predict the ratings of the second group. This is called the
cross-validation group. If the predicted scores are junk, then you need to go
back and repeat the process with more games, or a better taxonomy of positions
and mistakes, or both.

There's probably enough knowledge in this group to put together an awesome
suite.

Roger




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.