Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: If 75 Games are not considered a Statistical proof, neither is the SSDF.

Author: Uri Blass

Date: 22:25:33 01/30/01

On January 30, 2001 at 22:18:15, Bruce Moreland wrote:

>On January 30, 2001 at 21:24:13, Robert Hyatt wrote:
>
>>On January 30, 2001 at 17:42:59, Dann Corbit wrote:
>
>>>Additional measurements will not (in general) make the answer less accurate
>>>(unless something is wrong with the measurements).  However, if two programs are
>>>about equal, you will [basically] never determine which is stronger by playing
>>>them against each other.  For anyone who would like to prove this to themselves,
>>>just play a program against itself 10 times, 50 times, 100 times and 1000 times.
>>
>>This is what most miss, statistics-wise.  For someone with the time, it would
>>be interesting to have them use xboard/winboard and play a bunch of 100 game
>>matches between the _same_ version of an engine.  The outcome can be absolutely
>>astounding.  IE out of 100 games, you might get 30 wins, 10 losses, 60 draws
>>by A.  The next time you get 60 wins, 20 losses, 20 draws by A.  You begin
>>to conlcude A is better until you realize A and B are _identical_.  You run
>>the test again and B wins this time.
>
>Give me a match result, and the approximate rate at which the program draws
>games with a particular opponent, and I'll tell you how likely it is that the
>weaker program won the match.
>
>>I ran a _bunch_ of 100 game matches to convince me that no-recapture was better
>>than using recapture.  In my program.  But the first two 100 game matches had
>>me convinced that using recapture was better.  But the next 50 matches had
>>no-recapture winning almost all, although never by more than 10-20 points
>>total (out of 100 total).
>>
>>Pretty interesting stuff.  If I take that first 100 game match, and extract
>>any consecutive N games I choose, I can produce any outcome I would want.
>>
>>IE N-0-0, 0-N-0, 0-0-N, etc.  Which means that a small number of games is
>>hardly more than a crap-shoot.  Which makes you wonder what any tournament
>>winner means at all.  :)
>
>No, this is not true.  Here is an example.  Let's take two equally strong
>programs and discount the possibility of draw results (all games are won or
>lost).  The odds that a particular side will win 10 times in a row are 1 / 2^10.
> That's a relatively small number of games in the match, and the odds of getting
>that result with equal programs are quite low.  But if you do a long enough
>series, you will eventually find 10 wins in a row, even if the programs are the
>same strength.
>
>This does not mean that 10 wins in a row means nothing.  It means that it's
>improbable that the outcome is due to chance.  Obviously that doesn't mean that
>the outcome is impossible, and if you do a long enough run, it's going to happen
>*somewhere in the run*.
>
>If you *start* your run with two unknown programs, and encounter 10 wins in a
>row right at the start, what does that tell you?  It's a very unlikely outcome.
>Not impossible, but unlikely.
>
>You have to make a conclusion.  Does it make sense to say, "This could have been
>chance, I'll throw it out"?  It doesn't make sense.  Of course it could have
>been chance, but if you aren't willing to accept that it is much more likely
>that one of the programs is stronger than the other one, you may as well not
>play matches.
>
>Of course, if you *know* that they are the same strength, because they are the
>same engine, all you are doing is a probability exercise.  You are just flipping
>a coin a bunch of times and going "ooh" if you get a weird outcome.  And if you
>get a lot of insane outcomes your test is probably broken.
>
>The most important point I'd like to make with this, is that any match result
>can be achieved due to chance, assuming some hypothetical Elo delta between the
>two programs.
>
>Once again discounting the possibility of draws, a result of +40 -60 or worse
>for you will come up by chance almost 3% of the time, assuming the two programs
>are of equal strength.
>
>Why does this inspire more confidence than a result of +0 -10, which will come
>up by chance only about 0.1% of the time?
>
>You have two outcomes.  One is very unlikely, one is ridiculously unlikely.  Why
>are people willing to believe that there is a higher probability that the second
>one is due to chance?
>
>The only real gotcha with the +0 -10 result is that some fool might do a bunch
>of 10-game runs until they get that result, or that they'll pick a +0 -10 run
>out of a larger run and claim that it means something, which it doesn't.
>
>If you start two programs playing, and achieve a +0 -10, you've all but proven
>that the second is better than the first.  If you do a fixed 100-game match and
>get +40 -60, the odds of an erroneous conclusion are *much higher*, and so
>should be looked upon with more skepticism.

I disagree that the probability of error in the conclusion is higher in the
60-40 case.

I think that +60 -40 is more significant for programmers than 10-0 result
because the probability that the weaker side wins 60-40 after you know that
60-40 happened is smaller than the probability that the weaker side won 10-0
after you know that 10-0 happened.

assume that the weaker side has probability of p to win(p<1/2)

The probability of 10-0 result is p^10+(1-p)^10
The probability of the weaker side to win 10-0 is p^10

Conclusion:the probability of the weaker side to win 10-0 when 10-0 happened
is p^10/(p^10+(1-p)^10)

The probability of 60-40 result is (p^60*(1-p)^40+P^40*(1-p)^60)*C
When C=100!/(40!*60!)
The probability of the weaker side to win 60-40 is p^60*(1-p)^40*C

Conclusion the probability of the weaker side to win 60-40 after you know that
60-40 happened is p^60*(1-p)^40/(p^60*(1-p)^40+p^40*(1-p)^60)=
p^20/(p^20+(1-p)^20)

I got the last equality by dividing both sides of the equation by p^40*(1-p)^40

In general we can say that the probability of the weaker side to win by a
difference of n after you know that the difference is n is
p^n/(p^n+(1-p)^n)

It means that if you want to know only which program is stronger then the most
logical test is to play until the difference is n games.

I think that the level of confidence here is not important because the word
level of confidence is misleading.

the % of the cases that you want to get the right decision from the cases that
you make a decision is not the level of confidence and it is the important
number.

This number is a function of p and n.

The only case when 10-0 may be more significant is a case when you do not know p
so you suspect that p is bigger when you see 10-0 result but we need to know the
apriori distibution of p in order to decide that 10-0 is more significant.

Uri

Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Bruce Moreland 11:17:48 01/31/01
- Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Uri Blass 13:14:25 01/31/01
- Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Dann Corbit 12:37:21 01/31/01
  - Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Bruce Moreland 16:29:55 01/31/01
    - Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Uri Blass 22:43:17 01/31/01
    - Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Dann Corbit 17:08:28 01/31/01

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.