Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: If 75 Games are not considered a Statistical proof, neither is the SSDF.

Author: Bruce Moreland

Date: 11:17:48 01/31/01

On January 31, 2001 at 01:25:33, Uri Blass wrote:

>I disagree that the probability of error in the conclusion is higher in the
>60-40 case.

I don't think that you are allowed to disagree, since my argument has a correct
mathematical basis.

The probability of 40-60 or worse coming up in a fair coin flipping contest is
2.84%.  The probability of 0-10 coming up is 0.98%.

>I think that +60 -40 is more significant for programmers than 10-0 result
>because the probability that the weaker side wins 60-40 after you know that
>60-40 happened is smaller than the probability that the weaker side won 10-0
>after you know that 10-0 happened.

This is mathematically false, see above.

>assume that the weaker side has probability of p to win(p<1/2)
>
>The probability of 10-0 result is p^10+(1-p)^10
>The probability of the weaker side to win 10-0 is p^10
>
>Conclusion:the probability of the weaker side to win 10-0 when 10-0 happened
>is p^10/(p^10+(1-p)^10)
>
>The probability of 60-40 result is (p^60*(1-p)^40+P^40*(1-p)^60)*C
>When C=100!/(40!*60!)
>The probability of the weaker side to win 60-40 is p^60*(1-p)^40*C

I believe that what you are doing is defining the probability of a result that's
exactly 40-60.  But you can't do that, you have to account for the possiblity
that the result will be worse, too.

If we were to do a test where the result was a real number, we'd have fractional
results, so the choice of a result window one unit wide is really arbitrary, so
I believe you have to include results that are worse than 40-60.

Even so, if two equal programs play, the odds that a particular one will lose
40-60 are a little over 1%, according to your own formula (p^60*(1-p)^40*C),
which agrees with my own result, which is still ten times better than the odds
of an 0-10 result.

>Conclusion the probability of the weaker side to win 60-40 after you know that
>60-40 happened is p^60*(1-p)^40/(p^60*(1-p)^40+p^40*(1-p)^60)=
>p^20/(p^20+(1-p)^20)
>
>I got the last equality by dividing both sides of the equation by p^40*(1-p)^40

Assuming the programs are equal, you should just be able to square both
percentages, which results in a lower chance for 0-10 twice than for <= 40-60
twice, which is how I'd do it, as well as a lower chance for 0-10 than for
40-60, which is how you did it.

Not that any of this matters, since the idea that you have to have a duplicate
sub-run with an opposite result is silly.

>In general we can say that the probability of the weaker side to win by a
>difference of n after you know that the difference is n is
>p^n/(p^n+(1-p)^n)
>
>It means that if you want to know only which program is stronger then the most
>logical test is to play until the difference is n games.

I don't know where you are getting this math but this doesn't make any sense to
me.

>I think that the level of confidence here is not important because the word
>level of confidence is misleading.
>
>the % of the cases that you want to get the right decision from the cases that
>you make a decision is not the level of confidence and it is the important
>number.
>
>This number is a function of p and n.
>
>The only case when 10-0 may be more significant is a case when you do not know p
>so you suspect that p is bigger when you see 10-0 result but we need to know the
>apriori distibution of p in order to decide that 10-0 is more significant.

If you start a match and get 10-0 right away, it proves that p is bigger, by any
reasonable standard of proof.

Of course, for those of you who are going to take two identical programs, and
play 10-game matches until you get a 10-0 result, and declare that I'm an idiot
because of couse neither program is better than the other one, all you are doing
is rolling dice until you get a rare result, which proves nothing.

I haven't done any investigation of the following notion, but I think that doing
this with two programs that are too similar is also nonsense (for example,
trying to figure out if a minor change to a version makes the version better).
I don't know how to quantify that, but clearly it is an issue.

bruce

>Uri

Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Uri Blass 13:14:25 01/31/01
Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Dann Corbit 12:37:21 01/31/01
- Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Bruce Moreland 16:29:55 01/31/01
  - Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Uri Blass 22:43:17 01/31/01
  - Re: If 75 Games are not considered a Statistical proof, neither is the SSDF. Dann Corbit 17:08:28 01/31/01

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.