Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: How many games are needed to find out which program is stronger?

Author: Terry Ripple
Date: 00:25:16 09/04/99
On September 04, 1999 at 01:41:02, Bruce Moreland wrote:

>On September 03, 1999 at 03:51:05, Harald Faber wrote:
>
>>On September 03, 1999 at 03:16:50, Bruce Moreland wrote:
>>
>>>On September 02, 1999 at 20:02:55, Heiko Mikala wrote:
>>>
>>>>And I say what you're saying is clearly wrong. Believe me, I learned this the
>>>>hard way during the last ten years of work on my own chess program. I often had
>>>>the case that in a first test match of about 30-40 games my program convincingly
>>>>won a match, than let it play another, longer match overnight and during the
>>>>next day, which it than lost. You always need the same amount of games, no
>>>>matter how the score is after a first, short match. My experience after hundreds
>>>>of test matches shows, that you need at least 70-80 games to be able to come to
>>>>a conclusion. And you need some hundred games to be sure. Even if the first 15
>>>>games end in an 15-0 score. Because the next 15 games may end 0-15. This is a
>>>>frustrating fact, but it is *a fact*. It's frustrating, because for us as
>>>>programmers it means, that we have to do much more time consuming testing than
>>>>we would like to do.
>>>
>>>It shouldn't work like this.  You can't take a selection from somewhere in the
>>>middle of a long run of games, and use that to prove anything, but if you start
>>>out and play some games, and one program wins a several games in a row, you
>>>should be able to make a safe conclusion.
>>>
>>>I would really like to understand the 15-0 and 0-15 situation.  That should
>>>*not* happen.  That's not how math should work.  If you flip a coin 15 times and
>>>it comes up heads each time, the odds of this happening completely by chance are
>>>extremely small.  The odds that it would then come up tails 15 times in a row
>>>are also extremely small, and combined they should be vanishingly small.
>>
>>Maybe Heiko exaggerated a bit but many people have seen 9-1, 8.5-1.5 or s.th.
>>like these results in a row where the outcome was much different, more equal...
>>
>>That was the point.
>
>There needs to be a reason for this.  There does exist a rating point delta
>between two programs.  Let's say that one program "should" score 75% against
>another program, which is another way of saying that it is 200 points stronger.
>
>But there is a problem right away, there may not be very many sufficiently
>unique games the programs can play.  If my book is 1. e4 and yours is 1. ... c5,
>and we are both out of book at that point, perhaps the weaker program will win,
>and it will win the repeated game, etc., forever.
>
>With larger books than one or two moves each this should be less of a problem,
>but it will still exist.  If you talk about playing a thousand games, you are
>going to get some duplicates.
>
>But then we have the concept of book learning, and what that does to the sample.
>
>OK.  I ask you to discard all of these issues and assume that every time the
>programs play, they play a unique game.  Perhaps we need to look into those
>issues, but I'm not the one with the PGN.
>
>If the programs play 100% unique games, this notion of "real" rating point delta
>should kick in.  You should see wins, losses, and draws, in the proper
>proportion.  If you play some huge number of games, you should get an almost
>exact correspondence between the real rating point delta and the result, if you
>take this huge sample *as a whole*.
>
>Of course, in this huge game sample, you will get sub-samples that can be used
>to prove anything that you wish to prove.  You can't do this, you can't take a
>section out and say that somehow all small samples are suspect, I think.  The
>laws of probability dictate that such samples will exist, and nothing can be
>done about this.
>
>There is a chance, always, that you'll encounter one of these crazy samples
>right at the start.  But the odds that you get an extreme result due to engine
>superiority should also be calculable.  For any sample, you should be able to
>determine two numbers:  1) the number of rating points by which one engine is
>superior to the other, 2) the percent chance that the engine is really superior
>to the other one by at least that amount.  It is not necessarily true that a
>larger sample allows you to state that one engine is better than another with
>more confidence than a smaller sample does.
>
>This whole discussion should map onto coin testing, and that is why we're having
>this argument.  We are all doing these coin flipping matches, perhaps ten flips
>per match.  The coins are all approximately fair, so you should see a 10-0
>result maybe twice out of a thousand.  But it seems like people have the
>impression that these results are being reported a lot more than they should.
>
>I don't understand why this is, but I'd like to.  Perhaps it is because people
>are running 100-flip matches and reporting on these extreme 10-flip sub-matches,
>of which there are a *lot*.  You get 91 of those per 100-flip match.  The odds
>of at least one 10-0 or 0-10 sub-match in a 100-flip match are about 16%.  That
>is a lot more than 0.2%
>
>Long matches can allow you to make stronger assertions with more confidence, but
>sometimes the reverse is true.  The odds that the weaker program will win a
>match should always be calculable, and with a long match that has a close score,
>the odds end up being *higher* that the weaker program won, than in the case of
>a short match that is a wipe out.
>
>I haven't done the math, but it is probably true that you actually have less
>chance of making a mistake when you say that A is stronger than B, if A wins a
>match 6-0, than if A wins by 55-45.
>
>I think it is important that it be considered that a larger sample may give a
>false sense of security, in the case of a close match.  All a long close match
>proves is that one program is not massively stronger than the other, it does not
>prove which of them is stronger.
>
>bruce
--------

Hi bruce and heiko,
  That theory of short matches is compared through out chess history with human
matches! All human chess players get their ratings from years of tournaments
which are just a few rounds at a time in each event and the summ of all those
small events add up to ones rating! So, why should engine matches be any
different to determine their strength compared to another
engines strength and to determine their rating?

Regards,Terry
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.