Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: I will continue the match until there is a diffence of 7 games

Author: Peter Fendrich

Date: 14:22:51 12/20/00

Go up one level in this thread


On December 20, 2000 at 12:45:39, Christophe Theron wrote:

>On December 20, 2000 at 12:17:19, Uri Blass wrote:
>
>>On December 19, 2000 at 18:26:21, Bruce Moreland wrote:
>>
>>>On December 18, 2000 at 21:04:37, Christophe Theron wrote:
>>>
>>>>On December 18, 2000 at 17:43:43, Severi Salminen wrote:
>>>>
>>>>>On December 18, 2000 at 10:48:49, Jorge Pichard wrote:
>>>>>
>>>>>>On December 18, 2000 at 09:55:42, Severi Salminen wrote:
>>>>>>
>>>>>>>>I agree with you that 24 games isn't enough, but 200 games is not really
>>>>>>>>necessary if one of the two programs reach a difference of over 7 games, in
>>>>>>>>which at that point I will stop the match. More likely this won't happen since
>>>>>>>>these two programs are too evenly match so far.
>>>>>>>
>>>>>>>I don't understand. Where do you get that 7? Are you saying that the result
>>>>>>>104-96 is significant? Or, even worse, 16-8 (this means nothing in practice)?
>>>>>>>Why not 8, 25 or 10056? I think there is no point to stop when difference is
>>>>>>>something. There _is_ a point to run a match with many games (500+). The closer
>>>>>>>the two programs are the more games you need to show the true difference. Also
>>>>>>>the learning abilities of both programs have to be taken in account. The chess
>>>>>>>community still seems to lack the knowledge on how to measure the strenght
>>>>>>>difference between two programs...
>>>>>>>
>>>>>>>Severi
>>>>>>
>>>>>>Okay I will run this tourney up to 200 games, and will post the result as soon
>>>>>>as the tourney is over, or will Email the PGN games to anybody interested.
>>>>>
>>>>>That begins to sound interesting. 200 games match still has some error margins
>>>>>but we'll see a lot from that result. I'm looking forward for the results - not
>>>>>too often someone runs a 200+ match here in CCC, thanks!
>>>>>
>>>>>Severi
>>>>
>>>>
>>>>
>>>>On 200 games, the margin of error for 80% reliability is +/-3.5%.
>>>>For 70% reliability it's +/-3.0%.
>>>>
>>>>If a program wins the 200 games match by 53.5% (107-93) or more, you can say
>>>>with 80% relability that it is stronger than its opponent.
>>>>
>>>>If it wins by only 53% (106-94) you can say it is better, but only with 70%
>>>>reliability.
>>>
>>>I don't believe this.
>>>
>>>If you know that one of them is 200 Elo points better than the other one, you
>>>could figure out which one very accurately based upon this 107 wins thing,
>>>because the better one would almost always win at least 107 out of 200.  But
>>>additionally you know that it would rarely lose 107 out of 200, which allows you
>>>to make even stronger assertions.
>>>
>>>If they are very close together, small fractions of an Elo point, if one wins
>>>107 times it tells you nothing.  If you have A and B, and they are the same
>>>strength, and you don't have the possibility of a draw, A will win 107 or more
>>>about 18% of the time, and so will B.
>>>
>>>As you increase the known strength difference between A and B, you can with more
>>>accuracy determine the strong one.
>>>
>>>For example, my experiments show that if there are about 11 Elo points between
>>>them (one wins 66/128 of the games), and you get a score of 107 or more from a
>>>200-game match, you'll mis-identify the stronger one only about 20% of the time.
>>>
>>>This corresponds with what you say, but if you decrease the difference to 5 Elo
>>>points (67/128), you'd misidentify the stronger one about 1/3 of the time.
>>>
>>>I'm not a big statistics guy, but I can do experiments.  I can't figure out how
>>>people can come up with these very exact comments that don't seem to correspond
>>>with reality.
>>>
>>>A big problem that I've never seen accounted for is draw percentage.  When I've
>>>done simulations, the draw percentage seems to make a big difference in the
>>>probability that a given outcome is due to chance.
>>>
>>>My experiments with a coin flipping simulation indicate that the following
>>>statements are more or less true (there is some probability of rounding error,
>>>since I'm using integer math), based upon a single 200-game match, which returns
>>>a score of at least 107 wins, at most 93 losses:
>>>
>>>"The apparently stronger side (the side that scored at least 107 wins) is 98%
>>>likely to be no worse than 40 Elo points weaker than the side that scored 93
>>>wins.  The apparently stronger side is 80% likely to be no worse than 15 Elo
>>>points weaker than the apparently weaker side."
>>>
>>>These are much weaker statements than you make, but I think they are all you can
>>>make.
>>>
>>>I think the title of this thread is very interesting.  Essentially what it's
>>>saying is that he will try for significance, but if he can't get it he is going
>>>to guess.  That seems like a fair way to find a winner in a match, but if you
>>>are trying to figure out which is stronger, a coin shouldn't be involved.
>>>
>>>When people do these matches, they want to find the winner, and they want to
>>>determine the best, but what always happens is:
>>>
>>>1) A short match is lopsided (and perhaps statistically significant!), so people
>>>claim that the result was due to luck and discard it.
>>>
>>>2) A longer match is very close (and statistically insignificant), so they take
>>>it as significant, because they think that more trials must mean more
>>>significance.
>>>
>>>What people want is to be able to make a strong statement with high confidence.
>>>People think that by doing more trials, they are automatically able to do this.
>>>But this is not true.  All running more trials allows you to do is make *some
>>>kind* of statement with more confidence.  It might be a weak statement.
>>>
>>>A match with fewer games *may* allow you to make the same (weak or strong)
>>>statement with the same degree of confidence, but people will never believe
>>>that.
>>>
>>>For example:
>>>
>>>1) Play 200 times.  If one wins 107 times or more, call that significant.
>>>
>>>2) Play 32 times.  If one wins 25 times or more, call that significant.
>>>
>>>My experiments indicate that these are about the same.  You'll be correct or
>>>incorrect about the same percentage of the time in each case, and this works
>>>regardless of which player is actually stronger and by how much.  The curves
>>>look very similar.
>>>
>>>The difference is that you are less likely to achieve 25 wins if you run 32
>>>trials, than you are to achieve 107 wins if you run 200 trials.  But if you do,
>>>it's no less significant.
>>
>>I think that 25 out of 32 is more significant than 107 out of 200.
>>
>>It is logical to do an experiment without a fixed number of games to decide
>>which program is stronger but I think that the rule to stop when the difference
>>is 7 games is not a good rule.
>>
>>A better rule is to stop if you get one of the following result without counting
>>draws
>>
>>5-0,7-1,9-2,10-3,12-4,13-5,15-6,16-7,18-8,19-9,20-10,22-11,23-12,24-13,26-14
>>27-15,28-16,30-17,31-18,32-19,33-20,35-21,36-22,37-23
>>
>>and stop when it is clear that no result out of these results is possible(for
>>example if the result is 28-24).
>>
>>The results(5-0,7-1...) are based on the program who is better with 95%
>>confidence.
>>
>>The practical confidence is smaller and I do not know of a good way to calculate
>>it except simulation.
>>
>>The probability to get 5-0 for one program is 1/32 and it means that the
>>probability to get 5-0 result between equal programs is 2/32 because both
>>programs can win.
>>
>>The probability to get 7-1 between equal programs is also less than 1/10 but the
>>probability to get one of the results 5-0 or 7-1 is bigger and I did not
>>caclulate it.
>>Calculating the probability to get one of the results 5-0,7-1... is a problem
>>that I do not know of a way to solve it except simulation.
>>
>>Uri
>
>
>Your rule of stopping when you get one of the "significant" results you have
>listed says approximately the same thing as my "reliability of matches" table.
>
>The main point is that, for a given confidence, you can compute a table giving
>the smallest winning percentage depending of number of games played which is
>enough to say that the match is significant (once again: within the chosen
>confidence level).
>
>This table should definitely be published in the CCC resource center.
>
>The problem is that computing this table is not easy, at least for me. You have
>to know the relevant formulas, and I actually do not know them.

In fact it's quite easy...
The trick is to approximate the game results to the bell curve and in order
to do that you need "enough" games. I would say at least 20 but that is a very
low figure and will generate other errors than those presented below.

First: It is better use the actual results (Win, Draw and loss) instead of
win% only. It gives some more information. For example the result 50-50
could be 100 draws or 50 win + 50 loss. The 100 draws is a more stable
(confident) result. Now to the formulas:
c is choosen to get the right confidence level. (More later on)
n is numer of games
m is the "win percentage" (between 0.0 and 1.0)
W, D, L are the number of Wins, Draws and Losses respectively, for one of the
both programs.
I will use SQRT() as the "square root" and ** as "power of"

Start to compute the standard deviation s:
s = SQRT( ( W*(1-m)**2 + D*(0.5-m)**2 + L*(0-m)**2 )/(n-1) )
Now use s to compute A:
A = c*s/SQRT(n)
By chosing different values of c we will get different confidence levels:
1.28 will give 80%, 1,96 will give 95%, 2.58 will give 99% etc.
These c-values is found everywhere in literature and on the web
where formulas for the bell curve are presented.

Now when we have computed A we can say with 95% confidence (if we used c=1.96)
that the selected program will get from m-A to m+A points against the other
program in the long run, in each game.
So, if Tiger wins 28 games, draws 4 and loses 18 in a 50 game match against
the famous program Y we would get:
W=28, D=4, L=18 and n=50
m = 0.6
s ~ 0.47
A ~ 0.13 if we used c=1.96
The interval is between 0.6-0.13=0.47 and 0.6+0.13=0.73
With 95% confidence we "know" that Tiger will get
between 0.47 and 0.73 points against Y in each game in the long run.
That means, with this result it isn't enough to claim that Tiger is the better
program. (I'm sorry - play more games!)
//Peter
PS. I will go for vacation during a week from now and can't answear, in a week,
to whatever comes up from this message.
I really hope that no error has sneaked into my text here...



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.