Author: Bruce Moreland
Date: 14:49:02 02/03/01
Go up one level in this thread
On February 01, 2001 at 17:08:36, Amir Ban wrote: >On January 31, 2001 at 20:17:17, Bruce Moreland wrote: > >>I expressed very forcefully that a 10-0 result was more valid than a 60-40 >>result. >> >>I've done some experimental tests and it appears that I'm wrong. >> > >No, you were right the first time. Check again. > >10-0 gets better than 99.9% confidence for the winner to be better. > >60-40 has about 95% confidence. > >To calculate confidence, you assume the null hypothesis, which is that the >result is NOT significant and is a random occurrence between equals. You >calculate the probability for that, and subtract from 1 to get confidence. > >Amir I've been dealing with a fever for the past two days so I haven't come back to this. I think this stuff is all very important. I have seen endless conclusions about computer chess strength, which are based upon intuition and common sense, which I think means that they are often wrong. We're clearly working in the realm of statistics here, but I think that most people aren't interested in doing proper statistical analysis. I want to try to change this, but I admit that I am not qualified. I have some math ability, but I haven't taken a statistics course, and I don't any experts on the subject. I am apt to make a lot of mistakes, but I am happy to continue to try to work this out, especially if others think that it is worth figuring out, too. Here is what prompted me to write the base post. If you have a miniscule Elo difference between the two opponents, you'll get 0-10 about 0.10% of the time, and you'll get 10-0 about 0.10% of the time. You'll get 40-60 1.08% of the time, and 60-40 1.08% of the time. So if you know that the to programs are the same strength, or very close to the same strength, either of these results proves nothing, since the odds of getting either purely by chance are the same. Example: You change one eval term in your engine, and then run a test. You happen to get a score of 10-0, which should be a very rare score. You didn't change the strength of the engine by very much, if at all, but you get this weird result. Let's assume an Elo delta that's one point. The odds of 0-10 are 0.09%, the odds of 10-0 are 0.10%. So even though these are very rare results, we expect to get the two extreme results at about the same rate. We happened to get one of them. We stood just about the same chance of getting the other. If you make the programs a little visibly different in strength, things change, but they don't change the way I expected. For my first test I arbitrarily picked 67/128 = 52% = 16 Elo points. In this case, you'll get 0-10 0.06% of the time, and 10-0 0.15% of the time. You'll get 40-60 0.39% of the time, but you'll get 60-40 2.45% of the time. These numbers include 0% draw percentage, and are derived by math, not by simulation, but the simulation I did backs these numbers up. What does this mean? If you know the two programs are 16 Elo points apart, and you do a match and happen to get a 10-0 score, and you declare the winner of the match to be the stronger one, you will be correct about 75% of the time. If you do a match and get a 60-40 score, and you declare the winner of the match to be the stronger one, you will be correct about 85% of the time. If the Elo delta is higher, 60-40 is even more likely to indicate which is really stronger. 10-0 is harder to get than 60-40, but if there is a difference between two engines, and you get 60-40, it seems that it's more likely to mean the right thing than if you get 10-0. Was I still right the first time? bruce
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.