Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: I'm wrong about 10-0 vs 60-40

Author: Bruce Moreland

Date: 14:49:02 02/03/01

Go up one level in this thread


On February 01, 2001 at 17:08:36, Amir Ban wrote:

>On January 31, 2001 at 20:17:17, Bruce Moreland wrote:
>
>>I expressed very forcefully that a 10-0 result was more valid than a 60-40
>>result.
>>
>>I've done some experimental tests and it appears that I'm wrong.
>>
>
>No, you were right the first time. Check again.
>
>10-0 gets better than 99.9% confidence for the winner to be better.
>
>60-40 has about 95% confidence.
>
>To calculate confidence, you assume the null hypothesis, which is that the
>result is NOT significant and is a random occurrence between equals. You
>calculate the probability for that, and subtract from 1 to get confidence.
>
>Amir

I've been dealing with a fever for the past two days so I haven't come back to
this.

I think this stuff is all very important.  I have seen endless conclusions about
computer chess strength, which are based upon intuition and common sense, which
I think means that they are often wrong.

We're clearly working in the realm of statistics here, but I think that most
people aren't interested in doing proper statistical analysis.

I want to try to change this, but I admit that I am not qualified. I have some
math ability, but I haven't taken a statistics course, and I don't any experts
on the subject.

I am apt to make a lot of mistakes, but I am happy to continue to try to work
this out, especially if others think that it is worth figuring out, too.

Here is what prompted me to write the base post.  If you have a miniscule Elo
difference between the two opponents, you'll get 0-10 about 0.10% of the time,
and you'll get 10-0 about 0.10% of the time.  You'll get 40-60 1.08% of the
time, and 60-40 1.08% of the time.

So if you know that the to programs are the same strength, or very close to the
same strength, either of these results proves nothing, since the odds of getting
either purely by chance are the same.

Example:

You change one eval term in your engine, and then run a test.  You happen to get
a score of 10-0, which should be a very rare score.  You didn't change the
strength of the engine by very much, if at all, but you get this weird result.
Let's assume an Elo delta that's one point.  The odds of 0-10 are 0.09%, the
odds of 10-0 are 0.10%.  So even though these are very rare results, we expect
to get the two extreme results at about the same rate.  We happened to get one
of them.  We stood just about the same chance of getting the other.

If you make the programs a little visibly different in strength, things change,
but they don't change the way I expected.  For my first test I arbitrarily
picked 67/128 = 52% = 16 Elo points.

In this case, you'll get 0-10 0.06% of the time, and 10-0 0.15% of the time.
You'll get 40-60 0.39% of the time, but you'll get 60-40 2.45% of the time.
These numbers include 0% draw percentage, and are derived by math, not by
simulation, but the simulation I did backs these numbers up.

What does this mean?

If you know the two programs are 16 Elo points apart, and you do a match and
happen to get a 10-0 score, and you declare the winner of the match to be the
stronger one, you will be correct about 75% of the time.

If you do a match and get a 60-40 score, and you declare the winner of the match
to be the stronger one, you will be correct about 85% of the time.

If the Elo delta is higher, 60-40 is even more likely to indicate which is
really stronger.

10-0 is harder to get than 60-40, but if there is a difference between two
engines, and you get 60-40, it seems that it's more likely to mean the right
thing than if you get 10-0.

Was I still right the first time?

bruce




This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.