Subject: Small number statistics and small differences

Author: Dan Homan

Date: 03:08:36 08/14/98

On August 12, 1998 at 09:15:19, Bruce Moreland wrote:

>On August 11, 1998 at 06:52:02, Tony Hedlund wrote:
>>>So I think 4-0 actually turns out to be a significant result.  If you score 4-0,
>>>you can say that there is a very good chance that the one with the wins is
>>>better than the ones with the losses.
>>>You can't say this if you pick out a string of 4 wins in a row in the midst of a
>>>longer match, since you might be selecting a fluke case, but if you just start
>>>from scratch, and get 4-0, you should be able to stop.  In fact I think you
>>>might be able to stop if you get 3.5 - 0.5, but I am less certain of this case.
>>>Someone who has more statistics than I may be willing to comment on this.
>>Recently I played the match Shredder2 P200 MMX 64MB - Rebel8 P90 16MB.
>>Rebel won the first four games but Shredder won the match with 11-9.
>That shouldn't happen very often.

The problem with small number statistics is that they can be very
mis-leading.  A 4-0 result in a 4 game match between nearly equal
programs (with 20% draw chances) happens about 1/40 th of the time.
A 3.5-0.5 (or better) result happens about 1/13 th of a time.

If program A beats program B by a score of 4-0, this means that A has
a 97% (roughly) chance of being stronger than A.  So it seems like a
pretty good bet that A is better than B, but consider the following

Say that you use this 4-game match technique to test new versions of
your program versus older versions.  Whenever you make a change you
run one of these matches and decide to keep the change only if you get
a 4-0 result.  Because you have a very well developed program, most
changes will have almost no effect on playing strength.  Even changes
that do increase the playing strength slightly will not affect the
1/40 odds of getting a 4-0 result very much.  So, you will get a
4-0 result 1/40 th of the time - regardless of whether the change
you make is good or bad.  So using these 4-game matches to decide
on playing strength increases will cause you to randomly select
which versions to keep and which to discard.

So a 97% confidence isn't that helpful after all - at least not for
what we chess programmer do.  The problem is that we are trying to
descriminate small differences in playing strength and 4-game match
just can't do that with any reliability.

 - Dan

