Author: Walter Koroljow
Date: 09:49:31 07/26/00
It is possible to infer something quantitative from the data that Chris has accumulated. Based on some plausible assumptions I make the following claim: ******************************************************* ***Claim: Chris Carsons' data implies that some program has a rating of 2496 or more with approximately a 95% confidence.*** ******************************************************* If the results were produced by a single program, its TPR would be 2544. This gives us some idea of how many rating points we have to give up to have 19-to-1 odds in our favor. To be fair, we should also mention that we could as well infer that there is a program with a rating at or below 2602 with a 95% confidence! The result above is based solely on the data and is therefore most useful only if the situations which produced the data do not change. On the other hand, such considerations may be used just like changing ratings to track the progress of say, GMs perhaps learning successful anti-computer tactics and of programmers fending them off. Anyway, I am interested in what, if anything, CCC makes of such a result. The rest of this post just derives the claim above. It is divided into three sections for easier reading. ***Assumptions***: ******************************************************** 1) Chris' TPR data posted 7-16-00. I used all the data after May of 97. This excluded three things: two DB matches and 9 games by MCP in 1990 on a 486-25. All the other data except for Zugzwang was for single or multiple pentium, K-6, or K-7 processors. I do not know what Zugzwang ran on so I left it in. The data used consisted of 154 games with 75 wins, 56 draws, and 23 losses. 2) The USCF win expectancy formula: We = 1/(1+10**((delta rating)/400)), where delta rating = (opponents rating) - (our rating). 3) All chess games are statistically independent events. 4) To save words, when I say "Program" I mean a combination of hardware and software, e.g., Hiarcs6 running on a Pmmx-200. 5) An event such as DJ6 at Dortmund is treated as DJ6 playing 9 games against 9 players all of whom were rated at the average Dortmund rating of 2702. 6) Some mild approximations are used in the arguments that follow. **Standard Deviation (Sigma) and Distribution of Score** ******************************************************** In what follows, let: w=win probability, d=draw probability, v1=variance of score after one game, N=number of games, and V=variance of score after N games <x>=average of x over the N GAMES (x is anything) Sigma=sqrt(V) Then we can compute: v1 = w+d/4 - (w+d/2)**2 where w and d apply to a single game. Then, since chess games are independent events, V=N*<v1>=N*(<w>+<d>/2-<(w+d/2)**2>). But (<x**2>) >= (<x>**2) for all real x. Therefore: V/N <= <w>+<d>/2-(<w>+<d/2>)**2. Or, to put it another way: (1) V <= N*(<w>+<d>/2-(<w>+<d/2>)**2). We can thus bound the variance using expression (1) above. For the average w and d, we use the empirical values from Chris' data (see remarks at the end of the "Minimum Guaranteed Rating" section). Namely, <w> = 75/154 = .487, and <d> = 56/154 = .3636. Incidentally, the win expectancy is (75+56/2)/154 = .669. With these values for <d> and <w>, we have: V <= 20.110 and Sigma <= 4.484. The total score is the sum of the scores of the games. By the central limit theorem, the distribution of a sum of identically distributed variables tends to a normal distribution as the number of games increases. We know from experience that this works well even if the distributions are only similar. We also know that for a sum of binomially distributed variables (chess without draws) the distribution is very close to normal for only 15 qames or so. Therefore we expect the distribution to be extremely close to normal after a full 154 games. **Minimum "Guaranteed" Rating"** ******************************** In this section we reject the hypothesis that all the programs in Chris' data have a rating at or below 2495. This is equivalent to accepting the hypothesis that there exists a program that has a rating above 2495. This is done at the 2-sigma level for a distribution that is very close to normal. Thus, this is a statement made at approximately the 95% confidence level. Using the USCF win expectancy formula and Chris' data, if all the programs have a rating of 2495, the score distribution will have a mean of 94.02 points . By the argument above, the sigma of this distribution will be at least 4.484. This means that the observed score of 103 is more than 2 sigma from the mean and we can reject the 2495 hypothesis at approximately the .05 significance level. If some of the programs have ratings lower than 2495, then the probability of getting 94.02 points must be even less than for the case of all-2495 programs. Thus, we reject the composite hypothesis that all the programs have a rating of 2495 or that some or all have ratings below 2495. That is the desired conclusion. It is possible to object to this argument on the grounds that the sigma derived above is based on sample values of <d> and <w> from Chris' data rather than the true values from an infinite number of samples. This objection is not significant for three reasons: 1) <d> and <w> from Chris' data are good approximations since there is so much data. E.g., we can estimate sigma(number of wins) = sqrt(154*(75/154)*(154-75)/154) = 6.2. From this, sigma(<w>) = .04. 2) The estimates for <d> and <w> can be either high or low (by a little amount, as seen in 1) above.) and the two kinds of variation tend to cancel each other. 3) The values used are just bounds on the true values and the conclusions are therefore conservative to begin with.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.