Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: I'm wrong about 10-0 vs 60-40

Author: Amir Ban

Date: 14:20:33 02/04/01

Go up one level in this thread


On February 03, 2001 at 17:49:02, Bruce Moreland wrote:

>On February 01, 2001 at 17:08:36, Amir Ban wrote:
>
>>On January 31, 2001 at 20:17:17, Bruce Moreland wrote:
>>
>>>I expressed very forcefully that a 10-0 result was more valid than a 60-40
>>>result.
>>>
>>>I've done some experimental tests and it appears that I'm wrong.
>>>
>>
>>No, you were right the first time. Check again.
>>
>>10-0 gets better than 99.9% confidence for the winner to be better.
>>
>>60-40 has about 95% confidence.
>>
>>To calculate confidence, you assume the null hypothesis, which is that the
>>result is NOT significant and is a random occurrence between equals. You
>>calculate the probability for that, and subtract from 1 to get confidence.
>>
>>Amir
>
>I've been dealing with a fever for the past two days so I haven't come back to
>this.
>
>I think this stuff is all very important.  I have seen endless conclusions about
>computer chess strength, which are based upon intuition and common sense, which
>I think means that they are often wrong.
>
>We're clearly working in the realm of statistics here, but I think that most
>people aren't interested in doing proper statistical analysis.
>
>I want to try to change this, but I admit that I am not qualified. I have some
>math ability, but I haven't taken a statistics course, and I don't any experts
>on the subject.
>
>I am apt to make a lot of mistakes, but I am happy to continue to try to work
>this out, especially if others think that it is worth figuring out, too.
>
>Here is what prompted me to write the base post.  If you have a miniscule Elo
>difference between the two opponents, you'll get 0-10 about 0.10% of the time,
>and you'll get 10-0 about 0.10% of the time.  You'll get 40-60 1.08% of the
>time, and 60-40 1.08% of the time.
>
>So if you know that the to programs are the same strength, or very close to the
>same strength, either of these results proves nothing, since the odds of getting
>either purely by chance are the same.
>
>Example:
>
>You change one eval term in your engine, and then run a test.  You happen to get
>a score of 10-0, which should be a very rare score.  You didn't change the
>strength of the engine by very much, if at all, but you get this weird result.
>Let's assume an Elo delta that's one point.  The odds of 0-10 are 0.09%, the
>odds of 10-0 are 0.10%.  So even though these are very rare results, we expect
>to get the two extreme results at about the same rate.  We happened to get one
>of them.  We stood just about the same chance of getting the other.
>
>If you make the programs a little visibly different in strength, things change,
>but they don't change the way I expected.  For my first test I arbitrarily
>picked 67/128 = 52% = 16 Elo points.
>
>In this case, you'll get 0-10 0.06% of the time, and 10-0 0.15% of the time.
>You'll get 40-60 0.39% of the time, but you'll get 60-40 2.45% of the time.
>These numbers include 0% draw percentage, and are derived by math, not by
>simulation, but the simulation I did backs these numbers up.
>
>What does this mean?
>
>If you know the two programs are 16 Elo points apart, and you do a match and
>happen to get a 10-0 score, and you declare the winner of the match to be the
>stronger one, you will be correct about 75% of the time.
>
>If you do a match and get a 60-40 score, and you declare the winner of the match
>to be the stronger one, you will be correct about 85% of the time.
>
>If the Elo delta is higher, 60-40 is even more likely to indicate which is
>really stronger.
>
>10-0 is harder to get than 60-40, but if there is a difference between two
>engines, and you get 60-40, it seems that it's more likely to mean the right
>thing than if you get 10-0.
>
>Was I still right the first time?
>
>bruce

Like Uri, you introduce a further assumption, that the ELO difference is bound
by some (small) value. First, this assumption has no basis when running some
program A vs. some program B.

Second, if you make what you think is a small change in a program, and get 10-0,
then the question is whether you believe in results. Because, as you say, if you
assume the ELO difference is 1 point, you have 0.09% probability for first
version being better, 0.10% probability for second version to be better, but
this still leaves 99.81% probability that your 1-point assumption was wrong.

Given these numbers, what do you choose to believe ?

Amir




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.