Author: Uri Blass
Date: 22:25:33 01/30/01
Go up one level in this thread
On January 30, 2001 at 22:18:15, Bruce Moreland wrote: >On January 30, 2001 at 21:24:13, Robert Hyatt wrote: > >>On January 30, 2001 at 17:42:59, Dann Corbit wrote: > >>>Additional measurements will not (in general) make the answer less accurate >>>(unless something is wrong with the measurements). However, if two programs are >>>about equal, you will [basically] never determine which is stronger by playing >>>them against each other. For anyone who would like to prove this to themselves, >>>just play a program against itself 10 times, 50 times, 100 times and 1000 times. >> >>This is what most miss, statistics-wise. For someone with the time, it would >>be interesting to have them use xboard/winboard and play a bunch of 100 game >>matches between the _same_ version of an engine. The outcome can be absolutely >>astounding. IE out of 100 games, you might get 30 wins, 10 losses, 60 draws >>by A. The next time you get 60 wins, 20 losses, 20 draws by A. You begin >>to conlcude A is better until you realize A and B are _identical_. You run >>the test again and B wins this time. > >Give me a match result, and the approximate rate at which the program draws >games with a particular opponent, and I'll tell you how likely it is that the >weaker program won the match. > >>I ran a _bunch_ of 100 game matches to convince me that no-recapture was better >>than using recapture. In my program. But the first two 100 game matches had >>me convinced that using recapture was better. But the next 50 matches had >>no-recapture winning almost all, although never by more than 10-20 points >>total (out of 100 total). >> >>Pretty interesting stuff. If I take that first 100 game match, and extract >>any consecutive N games I choose, I can produce any outcome I would want. >> >>IE N-0-0, 0-N-0, 0-0-N, etc. Which means that a small number of games is >>hardly more than a crap-shoot. Which makes you wonder what any tournament >>winner means at all. :) > >No, this is not true. Here is an example. Let's take two equally strong >programs and discount the possibility of draw results (all games are won or >lost). The odds that a particular side will win 10 times in a row are 1 / 2^10. > That's a relatively small number of games in the match, and the odds of getting >that result with equal programs are quite low. But if you do a long enough >series, you will eventually find 10 wins in a row, even if the programs are the >same strength. > >This does not mean that 10 wins in a row means nothing. It means that it's >improbable that the outcome is due to chance. Obviously that doesn't mean that >the outcome is impossible, and if you do a long enough run, it's going to happen >*somewhere in the run*. > >If you *start* your run with two unknown programs, and encounter 10 wins in a >row right at the start, what does that tell you? It's a very unlikely outcome. >Not impossible, but unlikely. > >You have to make a conclusion. Does it make sense to say, "This could have been >chance, I'll throw it out"? It doesn't make sense. Of course it could have >been chance, but if you aren't willing to accept that it is much more likely >that one of the programs is stronger than the other one, you may as well not >play matches. > >Of course, if you *know* that they are the same strength, because they are the >same engine, all you are doing is a probability exercise. You are just flipping >a coin a bunch of times and going "ooh" if you get a weird outcome. And if you >get a lot of insane outcomes your test is probably broken. > >The most important point I'd like to make with this, is that any match result >can be achieved due to chance, assuming some hypothetical Elo delta between the >two programs. > >Once again discounting the possibility of draws, a result of +40 -60 or worse >for you will come up by chance almost 3% of the time, assuming the two programs >are of equal strength. > >Why does this inspire more confidence than a result of +0 -10, which will come >up by chance only about 0.1% of the time? > >You have two outcomes. One is very unlikely, one is ridiculously unlikely. Why >are people willing to believe that there is a higher probability that the second >one is due to chance? > >The only real gotcha with the +0 -10 result is that some fool might do a bunch >of 10-game runs until they get that result, or that they'll pick a +0 -10 run >out of a larger run and claim that it means something, which it doesn't. > >If you start two programs playing, and achieve a +0 -10, you've all but proven >that the second is better than the first. If you do a fixed 100-game match and >get +40 -60, the odds of an erroneous conclusion are *much higher*, and so >should be looked upon with more skepticism. I disagree that the probability of error in the conclusion is higher in the 60-40 case. I think that +60 -40 is more significant for programmers than 10-0 result because the probability that the weaker side wins 60-40 after you know that 60-40 happened is smaller than the probability that the weaker side won 10-0 after you know that 10-0 happened. assume that the weaker side has probability of p to win(p<1/2) The probability of 10-0 result is p^10+(1-p)^10 The probability of the weaker side to win 10-0 is p^10 Conclusion:the probability of the weaker side to win 10-0 when 10-0 happened is p^10/(p^10+(1-p)^10) The probability of 60-40 result is (p^60*(1-p)^40+P^40*(1-p)^60)*C When C=100!/(40!*60!) The probability of the weaker side to win 60-40 is p^60*(1-p)^40*C Conclusion the probability of the weaker side to win 60-40 after you know that 60-40 happened is p^60*(1-p)^40/(p^60*(1-p)^40+p^40*(1-p)^60)= p^20/(p^20+(1-p)^20) I got the last equality by dividing both sides of the equation by p^40*(1-p)^40 In general we can say that the probability of the weaker side to win by a difference of n after you know that the difference is n is p^n/(p^n+(1-p)^n) It means that if you want to know only which program is stronger then the most logical test is to play until the difference is n games. I think that the level of confidence here is not important because the word level of confidence is misleading. the % of the cases that you want to get the right decision from the cases that you make a decision is not the level of confidence and it is the important number. This number is a function of p and n. The only case when 10-0 may be more significant is a case when you do not know p so you suspect that p is bigger when you see 10-0 result but we need to know the apriori distibution of p in order to decide that 10-0 is more significant. Uri
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.