Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: How many games are needed to find out which program is stronger?

Author: Bruce Moreland

Date: 22:41:02 09/03/99

Go up one level in this thread


On September 03, 1999 at 03:51:05, Harald Faber wrote:

>On September 03, 1999 at 03:16:50, Bruce Moreland wrote:
>
>>On September 02, 1999 at 20:02:55, Heiko Mikala wrote:
>>
>>>And I say what you're saying is clearly wrong. Believe me, I learned this the
>>>hard way during the last ten years of work on my own chess program. I often had
>>>the case that in a first test match of about 30-40 games my program convincingly
>>>won a match, than let it play another, longer match overnight and during the
>>>next day, which it than lost. You always need the same amount of games, no
>>>matter how the score is after a first, short match. My experience after hundreds
>>>of test matches shows, that you need at least 70-80 games to be able to come to
>>>a conclusion. And you need some hundred games to be sure. Even if the first 15
>>>games end in an 15-0 score. Because the next 15 games may end 0-15. This is a
>>>frustrating fact, but it is *a fact*. It's frustrating, because for us as
>>>programmers it means, that we have to do much more time consuming testing than
>>>we would like to do.
>>
>>It shouldn't work like this.  You can't take a selection from somewhere in the
>>middle of a long run of games, and use that to prove anything, but if you start
>>out and play some games, and one program wins a several games in a row, you
>>should be able to make a safe conclusion.
>>
>>I would really like to understand the 15-0 and 0-15 situation.  That should
>>*not* happen.  That's not how math should work.  If you flip a coin 15 times and
>>it comes up heads each time, the odds of this happening completely by chance are
>>extremely small.  The odds that it would then come up tails 15 times in a row
>>are also extremely small, and combined they should be vanishingly small.
>
>Maybe Heiko exaggerated a bit but many people have seen 9-1, 8.5-1.5 or s.th.
>like these results in a row where the outcome was much different, more equal...
>
>That was the point.

There needs to be a reason for this.  There does exist a rating point delta
between two programs.  Let's say that one program "should" score 75% against
another program, which is another way of saying that it is 200 points stronger.

But there is a problem right away, there may not be very many sufficiently
unique games the programs can play.  If my book is 1. e4 and yours is 1. ... c5,
and we are both out of book at that point, perhaps the weaker program will win,
and it will win the repeated game, etc., forever.

With larger books than one or two moves each this should be less of a problem,
but it will still exist.  If you talk about playing a thousand games, you are
going to get some duplicates.

But then we have the concept of book learning, and what that does to the sample.

OK.  I ask you to discard all of these issues and assume that every time the
programs play, they play a unique game.  Perhaps we need to look into those
issues, but I'm not the one with the PGN.

If the programs play 100% unique games, this notion of "real" rating point delta
should kick in.  You should see wins, losses, and draws, in the proper
proportion.  If you play some huge number of games, you should get an almost
exact correspondence between the real rating point delta and the result, if you
take this huge sample *as a whole*.

Of course, in this huge game sample, you will get sub-samples that can be used
to prove anything that you wish to prove.  You can't do this, you can't take a
section out and say that somehow all small samples are suspect, I think.  The
laws of probability dictate that such samples will exist, and nothing can be
done about this.

There is a chance, always, that you'll encounter one of these crazy samples
right at the start.  But the odds that you get an extreme result due to engine
superiority should also be calculable.  For any sample, you should be able to
determine two numbers:  1) the number of rating points by which one engine is
superior to the other, 2) the percent chance that the engine is really superior
to the other one by at least that amount.  It is not necessarily true that a
larger sample allows you to state that one engine is better than another with
more confidence than a smaller sample does.

This whole discussion should map onto coin testing, and that is why we're having
this argument.  We are all doing these coin flipping matches, perhaps ten flips
per match.  The coins are all approximately fair, so you should see a 10-0
result maybe twice out of a thousand.  But it seems like people have the
impression that these results are being reported a lot more than they should.

I don't understand why this is, but I'd like to.  Perhaps it is because people
are running 100-flip matches and reporting on these extreme 10-flip sub-matches,
of which there are a *lot*.  You get 91 of those per 100-flip match.  The odds
of at least one 10-0 or 0-10 sub-match in a 100-flip match are about 16%.  That
is a lot more than 0.2%

Long matches can allow you to make stronger assertions with more confidence, but
sometimes the reverse is true.  The odds that the weaker program will win a
match should always be calculable, and with a long match that has a close score,
the odds end up being *higher* that the weaker program won, than in the case of
a short match that is a wipe out.

I haven't done the math, but it is probably true that you actually have less
chance of making a mistake when you say that A is stronger than B, if A wins a
match 6-0, than if A wins by 55-45.

I think it is important that it be considered that a larger sample may give a
false sense of security, in the case of a close match.  All a long close match
proves is that one program is not massively stronger than the other, it does not
prove which of them is stronger.

bruce



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.