Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: I will continue the match until there is a diffence of 7 games

Author: Uri Blass
Date: 16:34:34 12/20/00
On December 20, 2000 at 17:22:51, Peter Fendrich wrote:

>On December 20, 2000 at 12:45:39, Christophe Theron wrote:
>
>>On December 20, 2000 at 12:17:19, Uri Blass wrote:
>>
>>>On December 19, 2000 at 18:26:21, Bruce Moreland wrote:
>>>
>>>>On December 18, 2000 at 21:04:37, Christophe Theron wrote:
>>>>
>>>>>On December 18, 2000 at 17:43:43, Severi Salminen wrote:
>>>>>
>>>>>>On December 18, 2000 at 10:48:49, Jorge Pichard wrote:
>>>>>>
>>>>>>>On December 18, 2000 at 09:55:42, Severi Salminen wrote:
>>>>>>>
>>>>>>>>>I agree with you that 24 games isn't enough, but 200 games is not really
>>>>>>>>>necessary if one of the two programs reach a difference of over 7 games, in
>>>>>>>>>which at that point I will stop the match. More likely this won't happen since
>>>>>>>>>these two programs are too evenly match so far.
>>>>>>>>
>>>>>>>>I don't understand. Where do you get that 7? Are you saying that the result
>>>>>>>>104-96 is significant? Or, even worse, 16-8 (this means nothing in practice)?
>>>>>>>>Why not 8, 25 or 10056? I think there is no point to stop when difference is
>>>>>>>>something. There _is_ a point to run a match with many games (500+). The closer
>>>>>>>>the two programs are the more games you need to show the true difference. Also
>>>>>>>>the learning abilities of both programs have to be taken in account. The chess
>>>>>>>>community still seems to lack the knowledge on how to measure the strenght
>>>>>>>>difference between two programs...
>>>>>>>>
>>>>>>>>Severi
>>>>>>>
>>>>>>>Okay I will run this tourney up to 200 games, and will post the result as soon
>>>>>>>as the tourney is over, or will Email the PGN games to anybody interested.
>>>>>>
>>>>>>That begins to sound interesting. 200 games match still has some error margins
>>>>>>but we'll see a lot from that result. I'm looking forward for the results - not
>>>>>>too often someone runs a 200+ match here in CCC, thanks!
>>>>>>
>>>>>>Severi
>>>>>
>>>>>
>>>>>
>>>>>On 200 games, the margin of error for 80% reliability is +/-3.5%.
>>>>>For 70% reliability it's +/-3.0%.
>>>>>
>>>>>If a program wins the 200 games match by 53.5% (107-93) or more, you can say
>>>>>with 80% relability that it is stronger than its opponent.
>>>>>
>>>>>If it wins by only 53% (106-94) you can say it is better, but only with 70%
>>>>>reliability.
>>>>
>>>>I don't believe this.
>>>>
>>>>If you know that one of them is 200 Elo points better than the other one, you
>>>>could figure out which one very accurately based upon this 107 wins thing,
>>>>because the better one would almost always win at least 107 out of 200.  But
>>>>additionally you know that it would rarely lose 107 out of 200, which allows you
>>>>to make even stronger assertions.
>>>>
>>>>If they are very close together, small fractions of an Elo point, if one wins
>>>>107 times it tells you nothing.  If you have A and B, and they are the same
>>>>strength, and you don't have the possibility of a draw, A will win 107 or more
>>>>about 18% of the time, and so will B.
>>>>
>>>>As you increase the known strength difference between A and B, you can with more
>>>>accuracy determine the strong one.
>>>>
>>>>For example, my experiments show that if there are about 11 Elo points between
>>>>them (one wins 66/128 of the games), and you get a score of 107 or more from a
>>>>200-game match, you'll mis-identify the stronger one only about 20% of the time.
>>>>
>>>>This corresponds with what you say, but if you decrease the difference to 5 Elo
>>>>points (67/128), you'd misidentify the stronger one about 1/3 of the time.
>>>>
>>>>I'm not a big statistics guy, but I can do experiments.  I can't figure out how
>>>>people can come up with these very exact comments that don't seem to correspond
>>>>with reality.
>>>>
>>>>A big problem that I've never seen accounted for is draw percentage.  When I've
>>>>done simulations, the draw percentage seems to make a big difference in the
>>>>probability that a given outcome is due to chance.
>>>>
>>>>My experiments with a coin flipping simulation indicate that the following
>>>>statements are more or less true (there is some probability of rounding error,
>>>>since I'm using integer math), based upon a single 200-game match, which returns
>>>>a score of at least 107 wins, at most 93 losses:
>>>>
>>>>"The apparently stronger side (the side that scored at least 107 wins) is 98%
>>>>likely to be no worse than 40 Elo points weaker than the side that scored 93
>>>>wins.  The apparently stronger side is 80% likely to be no worse than 15 Elo
>>>>points weaker than the apparently weaker side."
>>>>
>>>>These are much weaker statements than you make, but I think they are all you can
>>>>make.
>>>>
>>>>I think the title of this thread is very interesting.  Essentially what it's
>>>>saying is that he will try for significance, but if he can't get it he is going
>>>>to guess.  That seems like a fair way to find a winner in a match, but if you
>>>>are trying to figure out which is stronger, a coin shouldn't be involved.
>>>>
>>>>When people do these matches, they want to find the winner, and they want to
>>>>determine the best, but what always happens is:
>>>>
>>>>1) A short match is lopsided (and perhaps statistically significant!), so people
>>>>claim that the result was due to luck and discard it.
>>>>
>>>>2) A longer match is very close (and statistically insignificant), so they take
>>>>it as significant, because they think that more trials must mean more
>>>>significance.
>>>>
>>>>What people want is to be able to make a strong statement with high confidence.
>>>>People think that by doing more trials, they are automatically able to do this.
>>>>But this is not true.  All running more trials allows you to do is make *some
>>>>kind* of statement with more confidence.  It might be a weak statement.
>>>>
>>>>A match with fewer games *may* allow you to make the same (weak or strong)
>>>>statement with the same degree of confidence, but people will never believe
>>>>that.
>>>>
>>>>For example:
>>>>
>>>>1) Play 200 times.  If one wins 107 times or more, call that significant.
>>>>
>>>>2) Play 32 times.  If one wins 25 times or more, call that significant.
>>>>
>>>>My experiments indicate that these are about the same.  You'll be correct or
>>>>incorrect about the same percentage of the time in each case, and this works
>>>>regardless of which player is actually stronger and by how much.  The curves
>>>>look very similar.
>>>>
>>>>The difference is that you are less likely to achieve 25 wins if you run 32
>>>>trials, than you are to achieve 107 wins if you run 200 trials.  But if you do,
>>>>it's no less significant.
>>>
>>>I think that 25 out of 32 is more significant than 107 out of 200.
>>>
>>>It is logical to do an experiment without a fixed number of games to decide
>>>which program is stronger but I think that the rule to stop when the difference
>>>is 7 games is not a good rule.
>>>
>>>A better rule is to stop if you get one of the following result without counting
>>>draws
>>>
>>>5-0,7-1,9-2,10-3,12-4,13-5,15-6,16-7,18-8,19-9,20-10,22-11,23-12,24-13,26-14
>>>27-15,28-16,30-17,31-18,32-19,33-20,35-21,36-22,37-23
>>>
>>>and stop when it is clear that no result out of these results is possible(for
>>>example if the result is 28-24).
>>>
>>>The results(5-0,7-1...) are based on the program who is better with 95%
>>>confidence.
>>>
>>>The practical confidence is smaller and I do not know of a good way to calculate
>>>it except simulation.
>>>
>>>The probability to get 5-0 for one program is 1/32 and it means that the
>>>probability to get 5-0 result between equal programs is 2/32 because both
>>>programs can win.
>>>
>>>The probability to get 7-1 between equal programs is also less than 1/10 but the
>>>probability to get one of the results 5-0 or 7-1 is bigger and I did not
>>>caclulate it.
>>>Calculating the probability to get one of the results 5-0,7-1... is a problem
>>>that I do not know of a way to solve it except simulation.
>>>
>>>Uri
>>
>>
>>Your rule of stopping when you get one of the "significant" results you have
>>listed says approximately the same thing as my "reliability of matches" table.
>>
>>The main point is that, for a given confidence, you can compute a table giving
>>the smallest winning percentage depending of number of games played which is
>>enough to say that the match is significant (once again: within the chosen
>>confidence level).
>>
>>This table should definitely be published in the CCC resource center.
>>
>>The problem is that computing this table is not easy, at least for me. You have
>>to know the relevant formulas, and I actually do not know them.
>
>In fact it's quite easy...
>The trick is to approximate the game results to the bell curve and in order
>to do that you need "enough" games. I would say at least 20 but that is a very
>low figure and will generate other errors than those presented below.

1)I think it is better to use the binomical distribution in order to calculate
the tables because the normal distribution is only approximation and it is a
good approximation only when you have enough games(the approximation becomes
better when you have more games).

2)The tables based on the normal distribution are only correct if you use a
fixed number of games and I think it is more logical not to use a fixed number
of games.

In practical case you will prefer to stop matches after less than 20 games if
you see significant result(if you see 15-0 it seems to be a waste of time to
continue the games and it is more logical to stop it even before you see the
15-0 result) so it is more logical not to use a fixed number of games.

I think that tables that give list of results to stop the games and level of
confidence for the list are better.

The level of confidence for the list(assuming the list is list of results with
the same confidence) is clearly lower than the level of confidence for the
results inside the list and you have almost no confidence if the list is big
enough but big enough can be 1000000 games so it is not a practical problem.

You can calculate the confidence that you have for a list of results by a
simulation program and I do not know about a better way to do it.

Uri
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.