Author: Walter Koroljow
Date: 13:04:10 09/05/00
This is a sequel to an analysis posted in July 2000
(http://site2936.dellhost.com/forums/1/message.shtml?121825). The approach here
is simulation, not analysis, and the results are much stronger.
***Data,Results, and an Overview of the Approach***
*******************************************************
The data is Chris Carson's human vs. computer data posted 7-16-00 (and available
at http://home.interact.se/~w100107/welcome.htm). I used all the data for
single and multiple Pentium and Athlon processors at speeds of 200 or more MHz.
Games against unrated players were not included. Two games from the Dutch
championship were not included since no more than 4 moves were made. Games
played by Zugzwang at Lippstadt in July 98 were included although I do not know
what hardware was used. A list of the programs used is at the end of this post.
A total of 168 games was used and the programs scored 78 wins, 60 draws, and 30
losses for 64.3%. The average human rating was 2419.4 FIDE. A single program
with a rating of 2542 would have achieved this result.
In what follows, "Program" means a combination of hardware and software, e.g.,
Hiarcs6 running on a Pmmx-200.
Now suppose that program ratings are evenly distributed over a range of values
with the number of programs going to zero smoothly at the edges of this range.
(This will be described in more detail later). Then we compute the 95%
confidence intervals for the average rating of the programs as a function of the
spread of values. The results are:
95% Confidence Interval
Spread of Ratings for Average Rating
----------------- -----------------------
0 2504-2579
200 2504-2583
400 2504-2595.
For example, suppose the spread of ratings is 200 points. Suppose the mean is
2550 (in the confidence interval 2504-2583) and is in the center of the spread.
Then the ratings will range from 2450 to 2650.
As will be shown later, confidence intervals depend on the number of draws
produced. The results above are based on 60 draws out of the 168 games as was
observed. To verify that this does not critically affect the results, the
calculations were repeated for 50 draws (almost two standard deviations less
than observed). The results are:
Sensitivity Analysis: 10 fewer draws than observed
--------------------------------------------------
95% Confidence Interval
Spread of Ratings for Average Rating
----------------- -----------------------
0 2502-2581
200 2502-2586
400 2502-2597.
Draws are clearly not a critical issue.
The method of computation is to assume a mean and a spread and to generate
random program ratings with this distribution. Now simulate the 168 games and
get a score. We repeat this one million times (a Monte Carlo simulation) to get
a probability distribution of the total program score for these 168 games. If
this distribution shows that the probability of the observed score (108 points)
is less than 5%, we reject the assumed mean and spread since they predict such a
low probability for what was observed. By repeating the simulation for
different means and spreads, we finally get values we can accept. These are the
values in the tables above.
There are two major effects from increasing the spread. One is that the
confidence interval expands due to the increase in uncertainty introduced by the
spread. The other effect is that just increasing the spread (leaving the mean
alone) lowers the probable score because the win expectancy curve is steeper in
the region of the lower ratings. This second effect only applies when the
programs outrate the competition as they did here. The result is that the two
effects pretty much cancel each other for the left endpoint of the confidence
intervals and combine to move the right endpoint up as the spread increases.
The simulation (which contains the data used) takes about 6 pages of C. I will
provide it on request.
The rest of this post goes into more detail. It is divided into sections for
easier reading.
***Assumptions***:
********************************************************
1) The USCF win expectancy formula: We = 1/(1+10**((delta rating)/400)), where
delta rating = (opponents rating) - (our rating).
2) Of course, the probability of a draw is 0 for We = 0 or We = 1. We assume
that this probability varies linearly from these points to a maximum at We =
0.5. We set this maximum to get the right number of draws for all the 168
games.
3) All chess games are statistically independent events.
***The Distribution of Program Ratings***
*********************************************************
Ratings for the programs were generated via a uniform distribution over the
rating spread. Then the whole distribution was shifted right or left to get the
desired mean. This means that the edges of the uniform distribution were
smoothed and became "transition regions". The size of these transition regions
is on the order of one or two standard deviations of the mean of the ratings of
the thirty programs. This is about 5-10% of the spread of the ratings. So a
200 point spread would be bordered by two transition regions of about 10-20
points each.
**Simulation Verification**
********************************************************
First we note that the results for a spread of 0. can be derived analytically.
In what follows, let:
w=win probability,
d=draw probability,
We=win expectancy = w+d/2
V1=variance of score for one game
Sigma = sqrt(variance)
S = expected score for all games
We note that:
V1 = w+d/2 - (w+d/2)**2 - d/4
= We -We**2 -d/4
where w and d apply to a single game. Then, since chess games are independent
events, we can add these variances over all 168 games. The results are:
Sigma= sqrt(Sum(We)-Sum(We**2)-D/4)
S=Sum(We)
where D is the expected number of draws in the N games. Furthermore, since the
total score is the sum of 168 independent events, we expect the total score to
be distributed normally. This means we can calculate percentiles analytically.
All this can be evaluated via spreadsheet and compared to the results of the
simulation. Using 16 million simulation iterations the numbers for an assumed
program rating of 2504 and zero spread are:
Confidence
S Sigma for score=108
-------- ----- -------------
Analytical 100.662 4.370 95.4%
Simulation 100.671 4.385 95.9%.
The numbers give considerable credence to the simulation results for zero
spread.
For the case of non-zero spread, we note that we do not expect much difference
from the zero spread case. The reason is that everything is based on the
distribution of scores from a series of 168 games. But the win expectancy
formula is nearly linear over most of the range of rating differences we have to
deal with. That means that the scores will depend primarily on the mean program
rating and will be only weakly dependent on the spreads.
More specifically, the means of the randomized ratings were checked numerically
and the ratings were checked by eye. Finally, an independent program
(basically, an integration over the spread) was written to evaluate the number
of draws expected in the simulation. This number closely matched the number
produced by the simulation with non-zero spreads.
**Data Used**
**************************************************************
The following is the brute force data input I used. Each line corresponds to a
program. The first number in each line is the number of non-zero entries in the
line. The remaining numbers are the ratings of the human opponents for the
program.
int Human_Ratings[30][12]={
/*Reb8 P200*/{3,2450,2485,2485,0,0,0,0,0,0,0,0},
/*Hiarcs6 P200*/ 8,2485,2485,2485,2485,2485,2485,2500,2500,0,0,0,
/*CM5K P2-300*/ 6,2035,2305,2235,2408,2330,2235,0,0,0,0,0,
/*Hiarcs6 P2-266*/ 4,2125,2180,2330,2320,0,0,0,0,0,0,0,
/*Reb9 P225*/ 8,2190,2100,2330,2340,2408,2205,2100,2320,0,0,0,
/*Reb10 K62-450*/ 2,2795,2795,0,0,0,0,0,0,0,0,0,
/*Zugzwang ?*/ 11,2470,2610,2470,2475,2550,2455,2540,2470,2410,2475,2390,
/*CM6K P2-450*/ 7,2340,2325,2180,2225,2470,2130,2205,0,0,0,0,
/*Hiarcs6 P2-350*/ 7,2340,2325,2180,2225,2470,2130,2205,0,0,0,0,
/*Reb10 P2-400*/ 7,2340,2325,2180,2225,2470,2130,2205,0,0,0,0,
/*Reb-cen K6-450*/ 1,2585,0,0,0,0,0,0,0,0,0,0,
/*DJ6 4x-500*/ 1,2635,0,0,0,0,0,0,0,0,0,0,
/*DJ6 Cel-450*/ 1,2100,0,0,0,0,0,0,0,0,0,0,
/*DJ6 2x-450*/ 4,2443,2177,2271,2290,0,0,0,0,0,0,0,
/*Ferret multi*/ 1,2630,0,0,0,0,0,0,0,0,0,0,
/*Fritz6 multi*/ 1,2620,0,0,0,0,0,0,0,0,0,0,
/*Shred4 P3-500*/ 8,2600,2443,2298,2566,2326,2100,2320,2109,0,0,0,
/*Reb-cen K6-600*/ 8,2524,2593,2552,2486,2541,2515,2585,2501,0,0,0,
/*Pconners multi*/ 11,2506,2550,2567,2478,2533,2399,2472,2301,2417,2467,2428,
/*Reb-cen P2-350*/ 1,2399,0,0,0,0,0,0,0,0,0,0,
/*Reb-cen K6-300*/ 3,2466,2450,2466,0,0,0,0,0,0,0,0,
/*Reb-cen P3-500*/ 10,2432,2432,2320,2507,2177,2566,2445,2276,2271,2443,0,
/*Reb-cen K6-400*/ 4,2398,2398,2398,2398,0,0,0,0,0,0,0,
/*Fritz6 K7-500*/ 9, 2177,2515,2548,2298,2566,2276,2507,2100,2158,0,0,
/*Gundar P500*/ 6,2370,2370,2295,2483,2325,2405,0,0,0,0,0,
/*Fritz6 P3-500*/ 6,2405,2265,2285,2370,2483,2295,0,0,0,0,0,
/*Reb-cen K7-1000*/ 1,2485,0,0,0,0,0,0,0,0,0,0,
/*DJ6 8x-500*/ 9,2701,2755,2770,2615,2667,2660,2762,2649,2743,0,0,
/*Pconners multi*/ 11,2612,2513,2545,2623,2606,2357,2357,2547,2539,2480,2523,
/*Fritz6SSS multi*/ 9,2641,2537,2561,2540,2393,2641,2567,2498,2558,0,0};
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.