Author: Amir Ban
Date: 14:20:33 02/04/01
Go up one level in this thread
On February 03, 2001 at 17:49:02, Bruce Moreland wrote: >On February 01, 2001 at 17:08:36, Amir Ban wrote: > >>On January 31, 2001 at 20:17:17, Bruce Moreland wrote: >> >>>I expressed very forcefully that a 10-0 result was more valid than a 60-40 >>>result. >>> >>>I've done some experimental tests and it appears that I'm wrong. >>> >> >>No, you were right the first time. Check again. >> >>10-0 gets better than 99.9% confidence for the winner to be better. >> >>60-40 has about 95% confidence. >> >>To calculate confidence, you assume the null hypothesis, which is that the >>result is NOT significant and is a random occurrence between equals. You >>calculate the probability for that, and subtract from 1 to get confidence. >> >>Amir > >I've been dealing with a fever for the past two days so I haven't come back to >this. > >I think this stuff is all very important. I have seen endless conclusions about >computer chess strength, which are based upon intuition and common sense, which >I think means that they are often wrong. > >We're clearly working in the realm of statistics here, but I think that most >people aren't interested in doing proper statistical analysis. > >I want to try to change this, but I admit that I am not qualified. I have some >math ability, but I haven't taken a statistics course, and I don't any experts >on the subject. > >I am apt to make a lot of mistakes, but I am happy to continue to try to work >this out, especially if others think that it is worth figuring out, too. > >Here is what prompted me to write the base post. If you have a miniscule Elo >difference between the two opponents, you'll get 0-10 about 0.10% of the time, >and you'll get 10-0 about 0.10% of the time. You'll get 40-60 1.08% of the >time, and 60-40 1.08% of the time. > >So if you know that the to programs are the same strength, or very close to the >same strength, either of these results proves nothing, since the odds of getting >either purely by chance are the same. > >Example: > >You change one eval term in your engine, and then run a test. You happen to get >a score of 10-0, which should be a very rare score. You didn't change the >strength of the engine by very much, if at all, but you get this weird result. >Let's assume an Elo delta that's one point. The odds of 0-10 are 0.09%, the >odds of 10-0 are 0.10%. So even though these are very rare results, we expect >to get the two extreme results at about the same rate. We happened to get one >of them. We stood just about the same chance of getting the other. > >If you make the programs a little visibly different in strength, things change, >but they don't change the way I expected. For my first test I arbitrarily >picked 67/128 = 52% = 16 Elo points. > >In this case, you'll get 0-10 0.06% of the time, and 10-0 0.15% of the time. >You'll get 40-60 0.39% of the time, but you'll get 60-40 2.45% of the time. >These numbers include 0% draw percentage, and are derived by math, not by >simulation, but the simulation I did backs these numbers up. > >What does this mean? > >If you know the two programs are 16 Elo points apart, and you do a match and >happen to get a 10-0 score, and you declare the winner of the match to be the >stronger one, you will be correct about 75% of the time. > >If you do a match and get a 60-40 score, and you declare the winner of the match >to be the stronger one, you will be correct about 85% of the time. > >If the Elo delta is higher, 60-40 is even more likely to indicate which is >really stronger. > >10-0 is harder to get than 60-40, but if there is a difference between two >engines, and you get 60-40, it seems that it's more likely to mean the right >thing than if you get 10-0. > >Was I still right the first time? > >bruce Like Uri, you introduce a further assumption, that the ELO difference is bound by some (small) value. First, this assumption has no basis when running some program A vs. some program B. Second, if you make what you think is a small change in a program, and get 10-0, then the question is whether you believe in results. Because, as you say, if you assume the ELO difference is 1 point, you have 0.09% probability for first version being better, 0.10% probability for second version to be better, but this still leaves 99.81% probability that your 1-point assumption was wrong. Given these numbers, what do you choose to believe ? Amir
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.