Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Mchess Pro 7.1

Author: Bruce Moreland

Date: 15:22:48 12/12/97

Go up one level in this thread



On December 12, 1997 at 14:29:47, Willie Wood wrote:

>That's quite a turnaround for a bit of "fiddling."  I saw a couple of
>those games last night, and it didn't look to be any contest (assuming
>Klamath is mcp7).
>
>I guess those results represent too small a sample to be significant.
>How many games do you think are required to get a good sample?  I'm
>interested because, in developing my own program, I want to know how
>much testing is usually done to test program changes.  Seems like 10
>games is not enough.

The fiddling didn't do it.  My program lost another one this morning for
the same reason it lost the other two -- it let a passer get too far
advanced.  Perhaps it lost one at the WMCCC (against Junior) for the
same reason.  Obviously something I can do better.  But to say that I
did some magic thing to Ferret that made it win a few in a row would be
wrong.  I just got sick of that passer problem and tried to fix it in
the middle of the day.  I failed, apparently, since it happened again
today.

Note please, in the interests of fairness, that my computer is
significantly faster than the one running MChess.  I have a 533 mhz
Alpha, and Klamath has a 300 Mhz P2.  My Alpha is like 30% faster, I
don't remember the exact figure.  Mine is also automatic so I pick up a
few seconds now and then, and I don't try funky RxB experiments :-)

The number of games you need to prove a point depends upon the point you
are trying to prove, and the definition of "prove" that you are using.

If you are trying to prove that program A plays more interesting chess
than program B, you have to rely on your own eyeball.

If you want to prove that program A will beat program B at least 51% of
the time on equal hardware, the number of games you have to play depends
upon the result you get in the games.

You can figure out the probability that two equal programs would
generate a given result.  If this probability falls below some
threshold, like 5% or 2% or 1%, depending upon how picky you are, you
can say that program A beats program B most of the time, with pretty
good confidence.

If the first few games produce an extremely lop-sided result, for
instance 4-0, you  can figure out the odds of this happening by chance.

For instance, assume that 40% of games are draws and 35% are won by
white and 25% are won by black.  Assuming you get a 4-0 result, and play
white twice and black twice, the odds of this happening are 0.35 * 0.35
* 0.25 * 0.25, which is 0.00765625, which is a surprisingly small
number.

Assuming my estimate of winning and drawing percentages above is
accurate (I have no idea if it is), this means that you will get a 4-0
result for a particular program less than 1% of the time, if the
programs are in fact equal in strength.

Note that this isn't showing that A is a lot better than B, it is just
showing that A is at least a little better than B.  If you want to show
that A is a lot better than B, you'd need to do some different math.

Sometimes your conclusion will be wrong, but it should be fairly rare
that this is the case.

If you go through a longer match, and look for a string of four wins in
a row, you haven't proven that one program is better if you find it,
because you might just be picking the nicest cherries out of the basket.

If your result isn't as lop-sided as 4-0, you can find that you need
*tremendously* long matches to prove that one program is better than
another one.  I have done very long matches (hundreds of games) between
programs, and even though one side wins distinctly more games, I can't
prove that one program is better than the other.  If you see an apparent
edge for one program, but you determine that it is 25% likely that the
program that scored worse is actually the better program, how can you
feel great about the result?  I can't.

I haven't done the math today, but intuitively I would be very
mis-trustful of results like 55-45, for instance.  It would be easy to
say, oh, the one program must be better than the other, but this isn't
the case.

Some people think that they can tell which of two programs is stronger,
by eyeball, but I'm mistrustful of this.  If one program beats another
7-3 (a statistically insignificant result), which one do you think they
will pick?  How often do you see computer vs computer losses in which
the loser looks good?

bruce



This page took 0.02 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.