Author: Bruce Moreland
Date: 16:23:33 02/04/01
Go up one level in this thread
On February 04, 2001 at 17:20:33, Amir Ban wrote: >On February 03, 2001 at 17:49:02, Bruce Moreland wrote: > >>On February 01, 2001 at 17:08:36, Amir Ban wrote: >> >>>On January 31, 2001 at 20:17:17, Bruce Moreland wrote: >>> >>>>I expressed very forcefully that a 10-0 result was more valid than a 60-40 >>>>result. >>>> >>>>I've done some experimental tests and it appears that I'm wrong. >>>> >>> >>>No, you were right the first time. Check again. >>> >>>10-0 gets better than 99.9% confidence for the winner to be better. >>> >>>60-40 has about 95% confidence. >>> >>>To calculate confidence, you assume the null hypothesis, which is that the >>>result is NOT significant and is a random occurrence between equals. You >>>calculate the probability for that, and subtract from 1 to get confidence. >>> >>>Amir >> >>I've been dealing with a fever for the past two days so I haven't come back to >>this. >> >>I think this stuff is all very important. I have seen endless conclusions about >>computer chess strength, which are based upon intuition and common sense, which >>I think means that they are often wrong. >> >>We're clearly working in the realm of statistics here, but I think that most >>people aren't interested in doing proper statistical analysis. >> >>I want to try to change this, but I admit that I am not qualified. I have some >>math ability, but I haven't taken a statistics course, and I don't any experts >>on the subject. >> >>I am apt to make a lot of mistakes, but I am happy to continue to try to work >>this out, especially if others think that it is worth figuring out, too. >> >>Here is what prompted me to write the base post. If you have a miniscule Elo >>difference between the two opponents, you'll get 0-10 about 0.10% of the time, >>and you'll get 10-0 about 0.10% of the time. You'll get 40-60 1.08% of the >>time, and 60-40 1.08% of the time. >> >>So if you know that the to programs are the same strength, or very close to the >>same strength, either of these results proves nothing, since the odds of getting >>either purely by chance are the same. >> >>Example: >> >>You change one eval term in your engine, and then run a test. You happen to get >>a score of 10-0, which should be a very rare score. You didn't change the >>strength of the engine by very much, if at all, but you get this weird result. >>Let's assume an Elo delta that's one point. The odds of 0-10 are 0.09%, the >>odds of 10-0 are 0.10%. So even though these are very rare results, we expect >>to get the two extreme results at about the same rate. We happened to get one >>of them. We stood just about the same chance of getting the other. >> >>If you make the programs a little visibly different in strength, things change, >>but they don't change the way I expected. For my first test I arbitrarily >>picked 67/128 = 52% = 16 Elo points. >> >>In this case, you'll get 0-10 0.06% of the time, and 10-0 0.15% of the time. >>You'll get 40-60 0.39% of the time, but you'll get 60-40 2.45% of the time. >>These numbers include 0% draw percentage, and are derived by math, not by >>simulation, but the simulation I did backs these numbers up. >> >>What does this mean? >> >>If you know the two programs are 16 Elo points apart, and you do a match and >>happen to get a 10-0 score, and you declare the winner of the match to be the >>stronger one, you will be correct about 75% of the time. >> >>If you do a match and get a 60-40 score, and you declare the winner of the match >>to be the stronger one, you will be correct about 85% of the time. >> >>If the Elo delta is higher, 60-40 is even more likely to indicate which is >>really stronger. >> >>10-0 is harder to get than 60-40, but if there is a difference between two >>engines, and you get 60-40, it seems that it's more likely to mean the right >>thing than if you get 10-0. >> >>Was I still right the first time? >> >>bruce > >Like Uri, you introduce a further assumption, that the ELO difference is bound >by some (small) value. First, this assumption has no basis when running some >program A vs. some program B. > >Second, if you make what you think is a small change in a program, and get 10-0, >then the question is whether you believe in results. Because, as you say, if you >assume the ELO difference is 1 point, you have 0.09% probability for first >version being better, 0.10% probability for second version to be better, but >this still leaves 99.81% probability that your 1-point assumption was wrong. > >Given these numbers, what do you choose to believe ? > >Amir If that really happened? I would suspect my test setup first. There's no way I'd simply accept the result and declare that this proves that the one is at least a little better than the other. If I got this result between dis-similar programs I'd believe it. If I knew they were close I don't see how I could believe it, but I don't have much mathematical basis for this statement. I wouldn't try to test a one-line eval change, of course. bruce
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.