Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: I'm wrong about 10-0 vs 60-40

Author: Bruce Moreland

Date: 16:23:33 02/04/01

Go up one level in this thread


On February 04, 2001 at 17:20:33, Amir Ban wrote:

>On February 03, 2001 at 17:49:02, Bruce Moreland wrote:
>
>>On February 01, 2001 at 17:08:36, Amir Ban wrote:
>>
>>>On January 31, 2001 at 20:17:17, Bruce Moreland wrote:
>>>
>>>>I expressed very forcefully that a 10-0 result was more valid than a 60-40
>>>>result.
>>>>
>>>>I've done some experimental tests and it appears that I'm wrong.
>>>>
>>>
>>>No, you were right the first time. Check again.
>>>
>>>10-0 gets better than 99.9% confidence for the winner to be better.
>>>
>>>60-40 has about 95% confidence.
>>>
>>>To calculate confidence, you assume the null hypothesis, which is that the
>>>result is NOT significant and is a random occurrence between equals. You
>>>calculate the probability for that, and subtract from 1 to get confidence.
>>>
>>>Amir
>>
>>I've been dealing with a fever for the past two days so I haven't come back to
>>this.
>>
>>I think this stuff is all very important.  I have seen endless conclusions about
>>computer chess strength, which are based upon intuition and common sense, which
>>I think means that they are often wrong.
>>
>>We're clearly working in the realm of statistics here, but I think that most
>>people aren't interested in doing proper statistical analysis.
>>
>>I want to try to change this, but I admit that I am not qualified. I have some
>>math ability, but I haven't taken a statistics course, and I don't any experts
>>on the subject.
>>
>>I am apt to make a lot of mistakes, but I am happy to continue to try to work
>>this out, especially if others think that it is worth figuring out, too.
>>
>>Here is what prompted me to write the base post.  If you have a miniscule Elo
>>difference between the two opponents, you'll get 0-10 about 0.10% of the time,
>>and you'll get 10-0 about 0.10% of the time.  You'll get 40-60 1.08% of the
>>time, and 60-40 1.08% of the time.
>>
>>So if you know that the to programs are the same strength, or very close to the
>>same strength, either of these results proves nothing, since the odds of getting
>>either purely by chance are the same.
>>
>>Example:
>>
>>You change one eval term in your engine, and then run a test.  You happen to get
>>a score of 10-0, which should be a very rare score.  You didn't change the
>>strength of the engine by very much, if at all, but you get this weird result.
>>Let's assume an Elo delta that's one point.  The odds of 0-10 are 0.09%, the
>>odds of 10-0 are 0.10%.  So even though these are very rare results, we expect
>>to get the two extreme results at about the same rate.  We happened to get one
>>of them.  We stood just about the same chance of getting the other.
>>
>>If you make the programs a little visibly different in strength, things change,
>>but they don't change the way I expected.  For my first test I arbitrarily
>>picked 67/128 = 52% = 16 Elo points.
>>
>>In this case, you'll get 0-10 0.06% of the time, and 10-0 0.15% of the time.
>>You'll get 40-60 0.39% of the time, but you'll get 60-40 2.45% of the time.
>>These numbers include 0% draw percentage, and are derived by math, not by
>>simulation, but the simulation I did backs these numbers up.
>>
>>What does this mean?
>>
>>If you know the two programs are 16 Elo points apart, and you do a match and
>>happen to get a 10-0 score, and you declare the winner of the match to be the
>>stronger one, you will be correct about 75% of the time.
>>
>>If you do a match and get a 60-40 score, and you declare the winner of the match
>>to be the stronger one, you will be correct about 85% of the time.
>>
>>If the Elo delta is higher, 60-40 is even more likely to indicate which is
>>really stronger.
>>
>>10-0 is harder to get than 60-40, but if there is a difference between two
>>engines, and you get 60-40, it seems that it's more likely to mean the right
>>thing than if you get 10-0.
>>
>>Was I still right the first time?
>>
>>bruce
>
>Like Uri, you introduce a further assumption, that the ELO difference is bound
>by some (small) value. First, this assumption has no basis when running some
>program A vs. some program B.
>
>Second, if you make what you think is a small change in a program, and get 10-0,
>then the question is whether you believe in results. Because, as you say, if you
>assume the ELO difference is 1 point, you have 0.09% probability for first
>version being better, 0.10% probability for second version to be better, but
>this still leaves 99.81% probability that your 1-point assumption was wrong.
>
>Given these numbers, what do you choose to believe ?
>
>Amir

If that really happened?  I would suspect my test setup first.  There's no way
I'd simply accept the result and declare that this proves that the one is at
least a little better than the other.  If I got this result between dis-similar
programs I'd believe it.  If I knew they were close I don't see how I could
believe it, but I don't have much mathematical basis for this statement.

I wouldn't try to test a one-line eval change, of course.

bruce




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.