Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Test suites - can they reliably predict ELO?

Author: Enrique Irazoqui

Date: 03:26:50 12/12/99

Go up one level in this thread


On December 11, 1999 at 20:18:45, John Warfield wrote:

>On December 11, 1999 at 19:46:50, Bertil Eklund wrote:
>
>>On December 11, 1999 at 17:52:56, Tom King wrote:
>>
>>>Which of the well known test suites predicts the strength of chess programs most
>>>accurately?
>>>
>>>I ask this, because I recently made some *slight* mods. to the evaluation
>>>function in my program, Francesca. I ran the LCT-2 suite, and the results
>>>indicated that it was a wash - the modification gave me about 5 ELO points,
>>>apparently.
>>>
>>>I then ran a series of fast games against another amateur program. I realize
>>>it's important to play a large number of games, to reduce the margin of error,
>>>so I ran two matches of 65 games. The result was this:
>>>
>>>MATCH 1
>>>"Normal" Francesca scored 37% against the amateur program.
>>>
>>>MATCH 2
>>>"Modified" Francesca scored 45% against the amateur program.
>>>
>>>Quite a difference! It implies that the modification is worth over 50 ELO. I
>>>guess I need to play more games, against a variety of programs to verify whether
>>>this improvement is real, or imaginary.
>>>
>>>Anyhow, beware of reading too much into ELO predictions of test suites..
>>>
>>>Cheers All,
>>>Tom
>>
>>Hi!
>>
>>Mr Irazoquis secret test-suite is very impressing! I think it´s about 111
>>positions. He can predict a new programs strength better than any other test I
>>have seen so far. If his predictions remains as good as his previous results, I
>>hope we can stop publishing our list and just play for fun.
>>
>>Bertil SSDF
>
>  Why is this Test secret??

I don't publish my test because the moment I do it will be cooked and become
worthless. This is one of the reasons that make well known test suites
inaccurate, aside from the fact that they have few positions, some of these
positions are ambiguous or plain wrong and the rating formula doesn't make
sense. Results are so erratic and unrealistic that, for example, Fritz comes
best at the BS test and worst at the BT. Etc. etc.

A couple of months ago we started talking about test suites during one of the
Rebel GM games at ICC, and a programmer was straightforward enough to say that a
test won't work because he would cook it next day...

My test has by now 130 positions not included in any other test and took me 11
months so far to put it together, and quite a bit longer to figure it out, so
you can imagine that I feel quite reluctant to throw it to the garbage. But it
is a bit of a catch 22 situation: If I don't publish it, no one will trust it;
if I do, no one should. :(

In case you are interested, this is my current result of latest programs:

PIII-500    Test    SSDF scale
RT            0        2691
CM6K        -16        2675
N732        -27        2664
F6-F6a      -33        2658
F532        -33        2658
H732        -38        2653
J5          -70        2621
C171       -104        2587

Now I am running it with Shredder 4, Genius 6.5 and Zarkov 5, but it takes 2
boooooring days per program and I feel quite lazy at the moment.

Enrique



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.