Author: Bill McGaugh
Date: 13:18:48 02/11/99
Go up one level in this thread
I don't know how other people do it, but I came up with a method based on the
Elo system itself and using the results from the SSDF list.
-----------------------------------------------------
Here is the idea:
I assign Elo ratings to the problems themselves. To calibrate the
problems I must run them at three minutes a move (because SSDF ratings
are based on that time control) on a variety of programs. For example,
let's say we have 10 different programs with an average rating on the
SSDF list of 2450 (all on the same platform).
If we run a particular problem on all 10 programs and all solve it, we
throw it out. If we run it and all fail to solve it, we throw it out.
If we find a problem were 5 programs solve and 5 fail (in three
minutes), the program is given an Elo rating of 2450 (50%). If we find a
problem that only one out of 10 programs can solve, it is given an Elo
rating equal to winning 9 games out of 10 = 2450+358 (from Elo's
tables)= 2808. Once we have found a calibrated a large number of
numbers, we can use these problem ratings to rate programs. We compute
the average difficulty rating for the suite of problems and then test a
program by running it through the suite. If the suite rating is 2400
and the program gets 50% of the problems in the given time, then it has
a rating of 2400...etc. I hope this makes sense.
I started checking out the idea by taking some of the limited data from
the CCR site...using the times for the Louguet 2 and the BT test, I
arrived at the following, after throwing out the "irrelevent" problems:
bt4- 2664
bt9- 2664
bt12-2254
bt24-2254
bt25-2388
bt26-2388
bt29-2521
bt30-2254
lgt3-2628
lgt4-2197
lgt5-2628
lgt6-2136
lgt7-2443
lgt8-2320
lgt9-2259
lgt10-2443
lgt11-2505
lgt12-2689
lgt13-2628
lgt17-2074
lgt20-2013
lgt21-2259
lgt22-2013
lgt23-2197
lgt24-2505
lgt29-2259
lgt30-2443
lgt31-2259
lgt32-2443
lgt33-2566
average rating for the suite= 2376.47
so a score of 15 out of 30 = 2376
1 out of 30 = 1838 (from Elo's percentage expectancy tables...rounded
from the nearest percent...no
interpolation between percents)
2 out of 30 = 1954
3 out of 30 = 2010
4 out of 30 = 2054
5 out of 30 = 2103
6 out of 30 = 2136
7 out of 30 = 2165
8 out of 30 = 2201
9 out of 30 = 2227
10 out of 30 = 2251
11 out of 30 = 2281
12 out of 30 = 2304
13 out of 30 = 2326
14 out of 30 = 2355
15 out of 30 = 2376
16 out of 30 = 2397
17 out of 30 = 2426
18 out of 30 = 2448
19 out of 30 = 2471
20 out of 30 = 2501
21 out of 30 = 2525
22 out of 30 = 2551
23 out of 30 = 2587
24 out of 30 = 2616
25 out of 30 = 2649
26 out of 30 = 2698
27 out of 30 = 2742
28 out of 30 = 2798
29 out of 30 = 2914
Testing a few programs on my P100 with the suite:
Zarkov 4.5c = 2426
Mchess 7.1 = 2426
Hiarcs 6 = 2448
Rebel 8 = 2471
I also tried a little experiment on a K6-233, to study the effect of
doubling time on rating (using Zark 4.5c):
10 seconds - 8 positions solved -2201 rating
20- 12 - 2304
40-16 - 2397
80-18 - 2448
160-18 - 2448
320-21 - 2525
640-22 - 2551
1280-24- 2616 (an average of 59.29 points per doubling...almost exactly
what people have been saying for some time)
I think that my rating system has potential, but what I need to do is
based the problem ratings on a larger number of different programs over
a broader range of ratings and assemble a large suite (100+) problems
that are a nice combination of opening game, middle game, and endgame
positions, combining both tactical and positional problems.
----------------------------------------------------
The notes above are from over a month ago. I'm continuing to work
on building a test suite based on this method.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.