Computer Chess Club Archives


Search

Terms

Messages

Subject: Quantitative Implications of Chris Carsons' data.

Author: Walter Koroljow

Date: 09:49:31 07/26/00


It is possible to infer something quantitative from the data that Chris has
accumulated.  Based on some plausible assumptions I make the following claim:

*******************************************************
***Claim: Chris Carsons' data implies that some program has a rating of 2496 or
more with approximately a 95% confidence.***
*******************************************************

If the results were produced by a single program, its TPR would be 2544.  This
gives us some idea of how many rating points we have to give up to have 19-to-1
odds in our favor.  To be fair, we should also mention that we could as well
infer that there is a program with a rating at or below 2602 with a 95%
confidence!

The result above is based solely on the data and is therefore most useful only
if the situations which produced the data do not change.  On the other hand,
such considerations may be used just like changing ratings to track the progress
of say, GMs perhaps learning successful anti-computer tactics and of programmers
fending them off.  Anyway, I am interested in what, if anything, CCC makes of
such a result.

The rest of this post just derives the claim above.  It is divided into three
sections for easier reading.


***Assumptions***:
********************************************************

1) Chris' TPR data posted 7-16-00.  I used all the data after May of 97.  This
excluded three things: two DB matches and 9 games by MCP in 1990 on a 486-25.
All the other data except for Zugzwang was for single or multiple pentium, K-6,
or K-7 processors.  I do not know what Zugzwang ran on so I left it in.  The
data used consisted of 154 games with 75 wins, 56 draws, and 23 losses.

2) The USCF win expectancy formula: We = 1/(1+10**((delta rating)/400)), where
delta rating = (opponents rating) - (our rating).

3) All chess games are statistically independent events.

4) To save words, when I say "Program" I mean a combination of hardware and
software, e.g., Hiarcs6 running on a Pmmx-200.

5) An event such as DJ6 at Dortmund is treated as DJ6 playing 9 games against 9
players all of whom were rated at the average Dortmund rating of 2702.

6) Some mild approximations are used in the arguments that follow.


**Standard Deviation (Sigma) and Distribution of Score**
********************************************************

In what follows, let:

w=win probability,
d=draw probability,
v1=variance of score after one game,
N=number of games, and
V=variance of score after N games
<x>=average of x over the N GAMES (x is anything)
Sigma=sqrt(V)

Then we can compute:

v1 = w+d/4 - (w+d/2)**2  where w and d apply to a single game.  Then, since
chess games are independent events,

V=N*<v1>=N*(<w>+<d>/2-<(w+d/2)**2>).

But (<x**2>) >= (<x>**2) for all real x.  Therefore:

V/N <= <w>+<d>/2-(<w>+<d/2>)**2. Or, to put it another way:

(1)  V <= N*(<w>+<d>/2-(<w>+<d/2>)**2).

We can thus bound the variance using expression (1) above.  For the average w
and d, we use the empirical values from Chris' data (see remarks at the end of
the "Minimum Guaranteed Rating" section).  Namely, <w> = 75/154 = .487, and <d>
= 56/154 = .3636.  Incidentally, the win expectancy is (75+56/2)/154 = .669.
With these values for <d> and <w>, we have:

V <= 20.110 and
Sigma <= 4.484.

The total score is the sum of the scores of the games.  By the central limit
theorem, the distribution of a sum of identically distributed variables tends to
a normal distribution as the number of games increases.  We know from experience
that this works well even if the distributions are only similar.  We also know
that for a sum of binomially distributed variables (chess without draws) the
distribution is very close to normal for only 15 qames or so.  Therefore we
expect the distribution to be extremely close to normal after a full 154 games.


**Minimum "Guaranteed" Rating"**
********************************

In this section we reject the hypothesis that all the programs in Chris' data
have a rating at or below 2495.  This is equivalent to accepting the hypothesis
that there exists a program that has a rating above 2495.  This is done at the
2-sigma level for a distribution that is very close to normal.  Thus, this is a
statement made at approximately the 95% confidence level.

Using the USCF win expectancy formula and Chris' data, if all the programs have
a rating of 2495, the score distribution  will have a mean  of 94.02 points .
By the argument above, the sigma of this distribution will be at least 4.484.
This means that the observed score of 103 is more than 2 sigma from the mean and
we can reject the 2495 hypothesis at approximately the .05 significance level.
If some of the programs have ratings lower than 2495, then the probability of
getting 94.02 points must be even less than for the case of all-2495 programs.
Thus, we reject the composite hypothesis that all the programs have a rating of
2495 or that some or all have ratings below 2495.  That is the desired
conclusion.

It is possible to object to this argument on the grounds that the sigma derived
above is based on sample values of <d> and <w> from Chris' data rather than the
true values from an infinite number of samples.  This objection is not
significant for three reasons:

1) <d> and <w> from Chris' data are good approximations since there is so much
data. E.g., we can estimate sigma(number of wins) =
sqrt(154*(75/154)*(154-75)/154) = 6.2.  From this, sigma(<w>) = .04.

2) The estimates for <d> and <w> can be either high or low (by a little amount,
as seen in 1) above.) and the two kinds of variation tend to cancel each other.

3) The values used are just bounds on the true values and the conclusions are
therefore conservative to begin with.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.