Author: Dann Corbit
Date: 19:33:11 01/30/01
Go up one level in this thread
On January 30, 2001 at 22:18:15, Bruce Moreland wrote: >On January 30, 2001 at 21:24:13, Robert Hyatt wrote: > >>On January 30, 2001 at 17:42:59, Dann Corbit wrote: > >>>Additional measurements will not (in general) make the answer less accurate >>>(unless something is wrong with the measurements). However, if two programs are >>>about equal, you will [basically] never determine which is stronger by playing >>>them against each other. For anyone who would like to prove this to themselves, >>>just play a program against itself 10 times, 50 times, 100 times and 1000 times. >> >>This is what most miss, statistics-wise. For someone with the time, it would >>be interesting to have them use xboard/winboard and play a bunch of 100 game >>matches between the _same_ version of an engine. The outcome can be absolutely >>astounding. IE out of 100 games, you might get 30 wins, 10 losses, 60 draws >>by A. The next time you get 60 wins, 20 losses, 20 draws by A. You begin >>to conlcude A is better until you realize A and B are _identical_. You run >>the test again and B wins this time. > >Give me a match result, and the approximate rate at which the program draws >games with a particular opponent, and I'll tell you how likely it is that the >weaker program won the match. > >>I ran a _bunch_ of 100 game matches to convince me that no-recapture was better >>than using recapture. In my program. But the first two 100 game matches had >>me convinced that using recapture was better. But the next 50 matches had >>no-recapture winning almost all, although never by more than 10-20 points >>total (out of 100 total). >> >>Pretty interesting stuff. If I take that first 100 game match, and extract >>any consecutive N games I choose, I can produce any outcome I would want. >> >>IE N-0-0, 0-N-0, 0-0-N, etc. Which means that a small number of games is >>hardly more than a crap-shoot. Which makes you wonder what any tournament >>winner means at all. :) > >No, this is not true. Here is an example. Let's take two equally strong >programs and discount the possibility of draw results (all games are won or >lost). The odds that a particular side will win 10 times in a row are 1 / 2^10. > That's a relatively small number of games in the match, and the odds of getting >that result with equal programs are quite low. But if you do a long enough >series, you will eventually find 10 wins in a row, even if the programs are the >same strength. > >This does not mean that 10 wins in a row means nothing. It means that it's >improbable that the outcome is due to chance. Obviously that doesn't mean that >the outcome is impossible, and if you do a long enough run, it's going to happen >*somewhere in the run*. > >If you *start* your run with two unknown programs, and encounter 10 wins in a >row right at the start, what does that tell you? It's a very unlikely outcome. >Not impossible, but unlikely. > >You have to make a conclusion. Does it make sense to say, "This could have been >chance, I'll throw it out"? It doesn't make sense. Of course it could have >been chance, but if you aren't willing to accept that it is much more likely >that one of the programs is stronger than the other one, you may as well not >play matches. > >Of course, if you *know* that they are the same strength, because they are the >same engine, all you are doing is a probability exercise. You are just flipping >a coin a bunch of times and going "ooh" if you get a weird outcome. And if you >get a lot of insane outcomes your test is probably broken. > >The most important point I'd like to make with this, is that any match result >can be achieved due to chance, assuming some hypothetical Elo delta between the >two programs. That is exactly why a large number of runs are needed. What they will give you is a window of probability. In other words, there is a 2/3 chance that the strength of A is x +/- windowA and the strength of B is y +/= windowB. There is a 95 % chance that the strength of A is x +/- windowA2 and the strength of B is y +/= windowB2. As you run more and more tests, the windows get smaller and smaller. If you run a huge number of tests, the windows will be pinholes. If you flip a coin one trillion times and get 47% heads, then you can show statistically that it is NOT a fair coin. >Once again discounting the possibility of draws, a result of +40 -60 or worse >for you will come up by chance almost 3% of the time, assuming the two programs >are of equal strength. > >Why does this inspire more confidence than a result of +0 -10, which will come >up by chance only about 0.1% of the time? > >You have two outcomes. One is very unlikely, one is ridiculously unlikely. Why >are people willing to believe that there is a higher probability that the second >one is due to chance? For two series of ten measurements, you will clearly get different results. Keeping it simple with the fair coin, we will have a red penny and a black penny and flip each twice. Here are the possible outcomes: f1,f2 -- -- RH,RH RH,RT RT,RH RT,RT BH,BH BH,BT BT,BH BT,BT So, we have one way to get 4 heads and one way to get 4 tails and 6 ways to get something else. We therefore have the astonishing odds of 1/4 for a whitewash (for one side or the other) with a completely even match. Now, expand the experiment to include one trillion flips and we still have two different ways to achieve a whitewash but as you can see, the probability [just by counting] of a whitewash has already become vanishingly small by this time. >The only real gotcha with the +0 -10 result is that some fool might do a bunch >of 10-game runs until they get that result, or that they'll pick a +0 -10 run >out of a larger run and claim that it means something, which it doesn't. > >If you start two programs playing, and achieve a +0 -10, you've all but proven >that the second is better than the first. The SSDF has had results that started out like this and the "worst" program won! >If you do a fixed 100-game match and >get +40 -60, the odds of an erroneous conclusion are *much higher*, and so >should be looked upon with more skepticism. Back to class. ;-) >>> The figure *should* [obviously] hover around 50% points scored for each side. >>>It is very unlikely that the ten game match will be close to 50%. The 100 game >>>match will probably be fairly close. It is rather unlikely that the 1000 game >>>match will be far from 50%, but it is very unlikely it will be exactly 50%. In >>>fact, if it should be exactly 50%, the Chi-Squared Test will reject it! It >>>throws out both things that don't seem to fit the model and also things that fit >>>so perfectly something looks fishy. >> >> >>I think that 1000 game match might well be way off from 50%. IE it is not >>unlikely that one side will pull ahead a significant amount, and then they >>start playing equally for the rest of the match. But there is nothing in >>statistics that says after you flip a coin and get 100 consecutive heads, >>that sometime later you will get 100 consecutive tails to offset them. It >>is more likely that this series will simply end with heads being ahead 100 >>counts... > >In a 1000 game match, the odds that your program will score will be between 462 >and 538 are over 98%, assuming that it's exactly as strong as its opponent, and >assuming that draws are not possible. > >With a draw percentage of 35%, you have a 98% chance of being between 470 and >530. About half the time you'll be between 491 and 509. > >With 100 games and 35% draws, about 98% of the time you'll be somewhere between >41 and 59, and half the time you'll be between 47 and 53. > >So you have quite a bit less variation with 1000 than you do with 100. Now here we agree. I don't understand why you think less trials give better data, since you seem to understand what is going on here.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.