Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: How to use a [cough] EPD test suite to estimate ELO

Author: Bill McGaugh

Date: 13:18:48 02/11/99

Go up one level in this thread


I don't know how other people do it, but I came up with a method based on the
Elo system itself and using the results from the SSDF list.

-----------------------------------------------------

Here is the idea:

I assign Elo ratings to the problems themselves.  To calibrate the
problems I must run them at three minutes a move (because SSDF ratings
are based on that time control) on a variety of programs.  For example,
let's say we have 10 different programs with an average rating on the
SSDF list of 2450 (all on the same platform).
If we run a particular problem on all 10 programs and all solve it, we
throw it out.  If we run it and all fail to solve it, we throw it out.
If we find a problem were 5 programs solve and 5 fail (in three
minutes), the program is given an Elo rating of 2450 (50%). If we find a
problem that only one out of 10 programs can solve, it is given an Elo
rating equal to winning 9 games out of 10 = 2450+358 (from Elo's
tables)= 2808.  Once we have found a calibrated a large number of
numbers, we can use these problem ratings to rate programs. We compute
the average difficulty rating for the suite of problems and then test a
program by running it through the suite.  If the suite rating is 2400
and the program gets 50% of the problems in the given time, then it has
a rating of 2400...etc.  I hope this makes sense.

I started checking out the idea by taking some of the limited data from
the CCR site...using the times for the Louguet 2 and the BT test, I
arrived at the following, after throwing out the "irrelevent" problems:

 bt4- 2664
 bt9- 2664
 bt12-2254
 bt24-2254
 bt25-2388
 bt26-2388
 bt29-2521
 bt30-2254
 lgt3-2628
 lgt4-2197
 lgt5-2628
 lgt6-2136
 lgt7-2443
 lgt8-2320
 lgt9-2259
 lgt10-2443
 lgt11-2505
 lgt12-2689
 lgt13-2628
 lgt17-2074
 lgt20-2013
 lgt21-2259
 lgt22-2013
 lgt23-2197
 lgt24-2505
 lgt29-2259
 lgt30-2443
 lgt31-2259
 lgt32-2443
 lgt33-2566

 average rating for the suite= 2376.47
 so a score of 15 out of 30 = 2376

 1 out of 30 = 1838 (from Elo's percentage expectancy tables...rounded
                          from the nearest percent...no
 interpolation                             between percents)
 2 out of 30 = 1954
 3 out of 30 = 2010
 4 out of 30 = 2054
 5 out of 30 = 2103
 6 out of 30 = 2136
 7 out of 30 = 2165
 8 out of 30 = 2201
 9 out of 30 = 2227
 10 out of 30 = 2251
 11 out of 30 = 2281
 12 out of 30 = 2304
 13 out of 30 = 2326
 14 out of 30 = 2355
 15 out of 30 = 2376
 16 out of 30 = 2397
 17 out of 30 = 2426
 18 out of 30 = 2448
 19 out of 30 = 2471
 20 out of 30 = 2501
 21 out of 30 = 2525
 22 out of 30 = 2551
 23 out of 30 = 2587
 24 out of 30 = 2616
 25 out of 30 = 2649
 26 out of 30 = 2698
 27 out of 30 = 2742
 28 out of 30 = 2798
 29 out of 30 = 2914

Testing a few programs on my P100 with the suite:
Zarkov 4.5c = 2426
Mchess 7.1 = 2426
Hiarcs 6 = 2448
Rebel 8 = 2471

I also tried a little experiment on a K6-233, to study the effect of
doubling time on rating (using Zark 4.5c):

10 seconds - 8 positions solved -2201 rating
20- 12 - 2304
40-16 - 2397
80-18 - 2448
160-18 - 2448
320-21 - 2525
640-22 - 2551
1280-24- 2616 (an average of 59.29 points per doubling...almost exactly
what people have been saying for some time)


I think that my rating system has potential, but what I need to do is
based the problem ratings on a larger number of different programs over
a broader range of ratings and assemble a large suite (100+) problems
that are a nice combination of opening game, middle game, and endgame
positions, combining both tactical and positional problems.

----------------------------------------------------
The notes above are from over a month ago.  I'm continuing to work
on building a test suite based on this method.




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.