Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: How many games are needed to find out which program is stronger?

Author: James T. Walker

Date: 13:05:24 09/04/99

Go up one level in this thread


On September 04, 1999 at 01:41:02, Bruce Moreland wrote:

>On September 03, 1999 at 03:51:05, Harald Faber wrote:
>
>>On September 03, 1999 at 03:16:50, Bruce Moreland wrote:
>>
>>>On September 02, 1999 at 20:02:55, Heiko Mikala wrote:
>>>
>>>>And I say what you're saying is clearly wrong. Believe me, I learned this the
>>>>hard way during the last ten years of work on my own chess program. I often had
>>>>the case that in a first test match of about 30-40 games my program convincingly
>>>>won a match, than let it play another, longer match overnight and during the
>>>>next day, which it than lost. You always need the same amount of games, no
>>>>matter how the score is after a first, short match. My experience after hundreds
>>>>of test matches shows, that you need at least 70-80 games to be able to come to
>>>>a conclusion. And you need some hundred games to be sure. Even if the first 15
>>>>games end in an 15-0 score. Because the next 15 games may end 0-15. This is a
>>>>frustrating fact, but it is *a fact*. It's frustrating, because for us as
>>>>programmers it means, that we have to do much more time consuming testing than
>>>>we would like to do.
>>>
>>>It shouldn't work like this.  You can't take a selection from somewhere in the
>>>middle of a long run of games, and use that to prove anything, but if you start
>>>out and play some games, and one program wins a several games in a row, you
>>>should be able to make a safe conclusion.
>>>
>>>I would really like to understand the 15-0 and 0-15 situation.  That should
>>>*not* happen.  That's not how math should work.  If you flip a coin 15 times and
>>>it comes up heads each time, the odds of this happening completely by chance are
>>>extremely small.  The odds that it would then come up tails 15 times in a row
>>>are also extremely small, and combined they should be vanishingly small.
>>
>>Maybe Heiko exaggerated a bit but many people have seen 9-1, 8.5-1.5 or s.th.
>>like these results in a row where the outcome was much different, more equal...
>>
>>That was the point.
>
>There needs to be a reason for this.  There does exist a rating point delta
>between two programs.  Let's say that one program "should" score 75% against
>another program, which is another way of saying that it is 200 points stronger.
>
>But there is a problem right away, there may not be very many sufficiently
>unique games the programs can play.  If my book is 1. e4 and yours is 1. ... c5,
>and we are both out of book at that point, perhaps the weaker program will win,
>and it will win the repeated game, etc., forever.
>
>With larger books than one or two moves each this should be less of a problem,
>but it will still exist.  If you talk about playing a thousand games, you are
>going to get some duplicates.
>
>But then we have the concept of book learning, and what that does to the sample.
>
>OK.  I ask you to discard all of these issues and assume that every time the
>programs play, they play a unique game.  Perhaps we need to look into those
>issues, but I'm not the one with the PGN.
>
>If the programs play 100% unique games, this notion of "real" rating point delta
>should kick in.  You should see wins, losses, and draws, in the proper
>proportion.  If you play some huge number of games, you should get an almost
>exact correspondence between the real rating point delta and the result, if you
>take this huge sample *as a whole*.
>
>Of course, in this huge game sample, you will get sub-samples that can be used
>to prove anything that you wish to prove.  You can't do this, you can't take a
>section out and say that somehow all small samples are suspect, I think.  The
>laws of probability dictate that such samples will exist, and nothing can be
>done about this.
>
>There is a chance, always, that you'll encounter one of these crazy samples
>right at the start.  But the odds that you get an extreme result due to engine
>superiority should also be calculable.  For any sample, you should be able to
>determine two numbers:  1) the number of rating points by which one engine is
>superior to the other, 2) the percent chance that the engine is really superior
>to the other one by at least that amount.  It is not necessarily true that a
>larger sample allows you to state that one engine is better than another with
>more confidence than a smaller sample does.
>
>This whole discussion should map onto coin testing, and that is why we're having
>this argument.  We are all doing these coin flipping matches, perhaps ten flips
>per match.  The coins are all approximately fair, so you should see a 10-0
>result maybe twice out of a thousand.  But it seems like people have the
>impression that these results are being reported a lot more than they should.
>
>I don't understand why this is, but I'd like to.  Perhaps it is because people
>are running 100-flip matches and reporting on these extreme 10-flip sub-matches,
>of which there are a *lot*.  You get 91 of those per 100-flip match.  The odds
>of at least one 10-0 or 0-10 sub-match in a 100-flip match are about 16%.  That
>is a lot more than 0.2%
>
>Long matches can allow you to make stronger assertions with more confidence, but
>sometimes the reverse is true.  The odds that the weaker program will win a
>match should always be calculable, and with a long match that has a close score,
>the odds end up being *higher* that the weaker program won, than in the case of
>a short match that is a wipe out.
>
>I haven't done the math, but it is probably true that you actually have less
>chance of making a mistake when you say that A is stronger than B, if A wins a
>match 6-0, than if A wins by 55-45.
>
>I think it is important that it be considered that a larger sample may give a
>false sense of security, in the case of a close match.  All a long close match
>proves is that one program is not massively stronger than the other, it does not
>prove which of them is stronger.
>
>bruce

Hello Bruce,
I think I understand what you are saying but I also think your last couple of
paragraphs are contrary to stastical probability.  I believe that stastics show
that the more data you have the more probable that you conclusion is correct.
Jim Walker



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.