Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: I will continue the match until there is a diffence of 7 games

Author: Uri Blass

Date: 09:17:19 12/20/00

Go up one level in this thread


On December 19, 2000 at 18:26:21, Bruce Moreland wrote:

>On December 18, 2000 at 21:04:37, Christophe Theron wrote:
>
>>On December 18, 2000 at 17:43:43, Severi Salminen wrote:
>>
>>>On December 18, 2000 at 10:48:49, Jorge Pichard wrote:
>>>
>>>>On December 18, 2000 at 09:55:42, Severi Salminen wrote:
>>>>
>>>>>>I agree with you that 24 games isn't enough, but 200 games is not really
>>>>>>necessary if one of the two programs reach a difference of over 7 games, in
>>>>>>which at that point I will stop the match. More likely this won't happen since
>>>>>>these two programs are too evenly match so far.
>>>>>
>>>>>I don't understand. Where do you get that 7? Are you saying that the result
>>>>>104-96 is significant? Or, even worse, 16-8 (this means nothing in practice)?
>>>>>Why not 8, 25 or 10056? I think there is no point to stop when difference is
>>>>>something. There _is_ a point to run a match with many games (500+). The closer
>>>>>the two programs are the more games you need to show the true difference. Also
>>>>>the learning abilities of both programs have to be taken in account. The chess
>>>>>community still seems to lack the knowledge on how to measure the strenght
>>>>>difference between two programs...
>>>>>
>>>>>Severi
>>>>
>>>>Okay I will run this tourney up to 200 games, and will post the result as soon
>>>>as the tourney is over, or will Email the PGN games to anybody interested.
>>>
>>>That begins to sound interesting. 200 games match still has some error margins
>>>but we'll see a lot from that result. I'm looking forward for the results - not
>>>too often someone runs a 200+ match here in CCC, thanks!
>>>
>>>Severi
>>
>>
>>
>>On 200 games, the margin of error for 80% reliability is +/-3.5%.
>>For 70% reliability it's +/-3.0%.
>>
>>If a program wins the 200 games match by 53.5% (107-93) or more, you can say
>>with 80% relability that it is stronger than its opponent.
>>
>>If it wins by only 53% (106-94) you can say it is better, but only with 70%
>>reliability.
>
>I don't believe this.
>
>If you know that one of them is 200 Elo points better than the other one, you
>could figure out which one very accurately based upon this 107 wins thing,
>because the better one would almost always win at least 107 out of 200.  But
>additionally you know that it would rarely lose 107 out of 200, which allows you
>to make even stronger assertions.
>
>If they are very close together, small fractions of an Elo point, if one wins
>107 times it tells you nothing.  If you have A and B, and they are the same
>strength, and you don't have the possibility of a draw, A will win 107 or more
>about 18% of the time, and so will B.
>
>As you increase the known strength difference between A and B, you can with more
>accuracy determine the strong one.
>
>For example, my experiments show that if there are about 11 Elo points between
>them (one wins 66/128 of the games), and you get a score of 107 or more from a
>200-game match, you'll mis-identify the stronger one only about 20% of the time.
>
>This corresponds with what you say, but if you decrease the difference to 5 Elo
>points (67/128), you'd misidentify the stronger one about 1/3 of the time.
>
>I'm not a big statistics guy, but I can do experiments.  I can't figure out how
>people can come up with these very exact comments that don't seem to correspond
>with reality.
>
>A big problem that I've never seen accounted for is draw percentage.  When I've
>done simulations, the draw percentage seems to make a big difference in the
>probability that a given outcome is due to chance.
>
>My experiments with a coin flipping simulation indicate that the following
>statements are more or less true (there is some probability of rounding error,
>since I'm using integer math), based upon a single 200-game match, which returns
>a score of at least 107 wins, at most 93 losses:
>
>"The apparently stronger side (the side that scored at least 107 wins) is 98%
>likely to be no worse than 40 Elo points weaker than the side that scored 93
>wins.  The apparently stronger side is 80% likely to be no worse than 15 Elo
>points weaker than the apparently weaker side."
>
>These are much weaker statements than you make, but I think they are all you can
>make.
>
>I think the title of this thread is very interesting.  Essentially what it's
>saying is that he will try for significance, but if he can't get it he is going
>to guess.  That seems like a fair way to find a winner in a match, but if you
>are trying to figure out which is stronger, a coin shouldn't be involved.
>
>When people do these matches, they want to find the winner, and they want to
>determine the best, but what always happens is:
>
>1) A short match is lopsided (and perhaps statistically significant!), so people
>claim that the result was due to luck and discard it.
>
>2) A longer match is very close (and statistically insignificant), so they take
>it as significant, because they think that more trials must mean more
>significance.
>
>What people want is to be able to make a strong statement with high confidence.
>People think that by doing more trials, they are automatically able to do this.
>But this is not true.  All running more trials allows you to do is make *some
>kind* of statement with more confidence.  It might be a weak statement.
>
>A match with fewer games *may* allow you to make the same (weak or strong)
>statement with the same degree of confidence, but people will never believe
>that.
>
>For example:
>
>1) Play 200 times.  If one wins 107 times or more, call that significant.
>
>2) Play 32 times.  If one wins 25 times or more, call that significant.
>
>My experiments indicate that these are about the same.  You'll be correct or
>incorrect about the same percentage of the time in each case, and this works
>regardless of which player is actually stronger and by how much.  The curves
>look very similar.
>
>The difference is that you are less likely to achieve 25 wins if you run 32
>trials, than you are to achieve 107 wins if you run 200 trials.  But if you do,
>it's no less significant.

I think that 25 out of 32 is more significant than 107 out of 200.

It is logical to do an experiment without a fixed number of games to decide
which program is stronger but I think that the rule to stop when the difference
is 7 games is not a good rule.

A better rule is to stop if you get one of the following result without counting
draws

5-0,7-1,9-2,10-3,12-4,13-5,15-6,16-7,18-8,19-9,20-10,22-11,23-12,24-13,26-14
27-15,28-16,30-17,31-18,32-19,33-20,35-21,36-22,37-23

and stop when it is clear that no result out of these results is possible(for
example if the result is 28-24).

The results(5-0,7-1...) are based on the program who is better with 95%
confidence.

The practical confidence is smaller and I do not know of a good way to calculate
it except simulation.

The probability to get 5-0 for one program is 1/32 and it means that the
probability to get 5-0 result between equal programs is 2/32 because both
programs can win.

The probability to get 7-1 between equal programs is also less than 1/10 but the
probability to get one of the results 5-0 or 7-1 is bigger and I did not
caclulate it.
Calculating the probability to get one of the results 5-0,7-1... is a problem
that I do not know of a way to solve it except simulation.

Uri



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.