Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: I would call this SPORT christophe !

Author: Bruce Moreland

Date: 18:00:27 08/09/98

Go up one level in this thread



On August 09, 1998 at 09:59:47, Christophe Theron wrote:

>On August 07, 1998 at 14:30:32, Thorsten Czub wrote:

>>The best thing would be: 30 games against a, 30 games against b, 30 games
>>against c etc. But often there is not much time.
>>
>>So you maybe only play 10 games.
>
>NEVER. 10 games is NOT ENOUGH.
>
>I have found that a 30 games match has a +/- 5% error margin. That is a 50%
>result can mean 45% to 55%.
>
>So a 10 games match gives NO USEFUL INFORMATION (unless you are playing Chess
>Challenger 3 against Rebel 10).
>
>Of course if it is 10 games against A, 10 against B and 10 against C, it is a 30
>games match against an average opposition and it is useful. I suppose it is what
>you meant?

You guys are both not quite right, I think, although Christophe's comment about
Rebel vs Challenger 3 is pretty close.  If you are trying to prove that program
A is stronger than program B (but not by how much it is stronger), then
sometimes 30 games is nowhere near enough, and sometimes 10 games is way too
many.

If you are trying to test to see if a coin is more apt to come up heads than
tails, you can never know for absolute sure by flipping it, since any result is
possible.

But you can determine what the chance was that you would have a particular
result with a fair coin, and if the chance was low enough, you can say that the
coin probably isn't fair.

You can start flipping the coin, and if at any point you get a result that
indicates that the coin is unfair, you can stop.

For instance, you may decide to consider the coin to be unfair if you get a
result that will happen less than 5% of the time if the coin is fair.

If you flip the coin five times and it comes up heads all five times, the odds
are 1/32 that this will happen with a fair coin, and this is less than 5%, so
you can conclude that the coin isn't fair.

You can say, "this coin is at least slightly more likely to come up heads than
tails."

If you don't get such a dramatic result, and there are tails mixed in with the
heads, it will take you a much longer time to find an unfair coin.  But when you
finally do get a 95% confidence that the coin is unfair, you are not one bit
more certain that the coin is unfair than if you flipped it five times and it
came up heads each time.

Chess isn't coin flipping.  White has a better chance of winning than black, and
there are draws.  I don't know exactly what affect this has, but I think the
existence of draws decreases the number of trials you need to get signficance if
you have a wipeout result.

So I think 4-0 actually turns out to be a significant result.  If you score 4-0,
you can say that there is a very good chance that the one with the wins is
better than the ones with the losses.

You can't say this if you pick out a string of 4 wins in a row in the midst of a
longer match, since you might be selecting a fluke case, but if you just start
from scratch, and get 4-0, you should be able to stop.  In fact I think you
might be able to stop if you get 3.5 - 0.5, but I am less certain of this case.
Someone who has more statistics than I may be willing to comment on this.

Now, if you do 30 games, you might think you are safe, but you are probably not.
 I'm sure you can get some results where one side wins by a few games, perhaps
even quite a few games, and you still may not have proven with reasonable
confidence that one program is stronger than the other.

And remember, that what I'm talking about is "stronger", not "markedly
stronger".  If you get 4-0 you don't prove that one program is hundreds of
points stronger than the other one, just that it is at least slightly stronger.

I'm sure there is a way you can say, "I have shown that there is a 95% chance
that program A is at least 30 Elo points stronger than program B", but I'm not
sure exactly how to do it with confidence.  And I think that in practice, in
order to show this, program A has to beat program B pretty badly, although less
badly if you do more trials.

Some simple statistical concepts were regarded as top secrets during World War
II, because they allowed researches to prove that one drug was effective with
sometimes very few trials.  Not everyone had figured this out by then.

bruce




This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.