Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: I will continue the match until there is a diffence of 7 games

Author: Bruce Moreland

Date: 00:03:43 12/21/00

Go up one level in this thread


On December 20, 2000 at 20:05:39, Uri Blass wrote:

>On December 20, 2000 at 19:06:18, Bruce Moreland wrote:
>
>>On December 20, 2000 at 12:17:19, Uri Blass wrote:
>>
>>>I think that 25 out of 32 is more significant than 107 out of 200.
>>
>>I don't think it is a matter of opinion.
>>
>>You have two programs, A and B.  They play 32 games.  Each game is either won or
>>lost.  If one side doesn't score 25 or more, you repeat.  If one side scores 25
>>or more, you stop and call that program stronger.
>>
>>You do the same thing with 200 games and use 107 as your stop score.
>>
>>My experiments showed that for many different rating differences, the odds of
>>making a mistake was about the same.  For instance, if there is a rating point
>>difference of 25 Elo points, in the 200 case the weaker side will score at least
>>107 out of 200 about 7% of the time that someone does it, which will lead you to
>>a wrong conclusion.  In the 32 case, the weaker side will score 25 about 8% of
>>the time that someone does it, likewise leading you to a wrong conclusion.
>
>You are right that if you know before testing that the difference is small then
>25-7 is not so convincing about the question which program is better and it
>seems to be the case when programmers make an upgrade.
>
>In this case 25-7 for the new version is not convincing but 25-7 for the old
>version seems to be more convincing because if you see this kind of result you
>can suspect that the new version has a bug.

I'm not sure what this means.  If you know that the distance is big, why test
anything?  You already know the answer.

If the new version is 25 Elo points weaker than the old one, there is still a
10% chance that it will be the one returning this result, if either returns this
result.

So what is indicated by this?  It's not indicated that the new one is 200 points
stronger than the old one.  But you can say that that the new version probably
isn't wrecked.

That's why I have a hard time swallowing match results.  Someone gets a big
score in a match and says, "I've proven that this one is way better!"  No.  The
best you've done in many cases is shown that the one that won the match is
probably not much worse than the other one.  Maybe.

bruce

>Practically if I see 25-7 results between different programs I suspect that the
>difference is clearly bigger than 200 elo so the results seem very convincing to
>me because I do not have an opinion that the difference is small before testing.
>
>Uri



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.