Author: Bruce Moreland
Date: 15:26:21 12/19/00
Go up one level in this thread
On December 18, 2000 at 21:04:37, Christophe Theron wrote: >On December 18, 2000 at 17:43:43, Severi Salminen wrote: > >>On December 18, 2000 at 10:48:49, Jorge Pichard wrote: >> >>>On December 18, 2000 at 09:55:42, Severi Salminen wrote: >>> >>>>>I agree with you that 24 games isn't enough, but 200 games is not really >>>>>necessary if one of the two programs reach a difference of over 7 games, in >>>>>which at that point I will stop the match. More likely this won't happen since >>>>>these two programs are too evenly match so far. >>>> >>>>I don't understand. Where do you get that 7? Are you saying that the result >>>>104-96 is significant? Or, even worse, 16-8 (this means nothing in practice)? >>>>Why not 8, 25 or 10056? I think there is no point to stop when difference is >>>>something. There _is_ a point to run a match with many games (500+). The closer >>>>the two programs are the more games you need to show the true difference. Also >>>>the learning abilities of both programs have to be taken in account. The chess >>>>community still seems to lack the knowledge on how to measure the strenght >>>>difference between two programs... >>>> >>>>Severi >>> >>>Okay I will run this tourney up to 200 games, and will post the result as soon >>>as the tourney is over, or will Email the PGN games to anybody interested. >> >>That begins to sound interesting. 200 games match still has some error margins >>but we'll see a lot from that result. I'm looking forward for the results - not >>too often someone runs a 200+ match here in CCC, thanks! >> >>Severi > > > >On 200 games, the margin of error for 80% reliability is +/-3.5%. >For 70% reliability it's +/-3.0%. > >If a program wins the 200 games match by 53.5% (107-93) or more, you can say >with 80% relability that it is stronger than its opponent. > >If it wins by only 53% (106-94) you can say it is better, but only with 70% >reliability. I don't believe this. If you know that one of them is 200 Elo points better than the other one, you could figure out which one very accurately based upon this 107 wins thing, because the better one would almost always win at least 107 out of 200. But additionally you know that it would rarely lose 107 out of 200, which allows you to make even stronger assertions. If they are very close together, small fractions of an Elo point, if one wins 107 times it tells you nothing. If you have A and B, and they are the same strength, and you don't have the possibility of a draw, A will win 107 or more about 18% of the time, and so will B. As you increase the known strength difference between A and B, you can with more accuracy determine the strong one. For example, my experiments show that if there are about 11 Elo points between them (one wins 66/128 of the games), and you get a score of 107 or more from a 200-game match, you'll mis-identify the stronger one only about 20% of the time. This corresponds with what you say, but if you decrease the difference to 5 Elo points (67/128), you'd misidentify the stronger one about 1/3 of the time. I'm not a big statistics guy, but I can do experiments. I can't figure out how people can come up with these very exact comments that don't seem to correspond with reality. A big problem that I've never seen accounted for is draw percentage. When I've done simulations, the draw percentage seems to make a big difference in the probability that a given outcome is due to chance. My experiments with a coin flipping simulation indicate that the following statements are more or less true (there is some probability of rounding error, since I'm using integer math), based upon a single 200-game match, which returns a score of at least 107 wins, at most 93 losses: "The apparently stronger side (the side that scored at least 107 wins) is 98% likely to be no worse than 40 Elo points weaker than the side that scored 93 wins. The apparently stronger side is 80% likely to be no worse than 15 Elo points weaker than the apparently weaker side." These are much weaker statements than you make, but I think they are all you can make. I think the title of this thread is very interesting. Essentially what it's saying is that he will try for significance, but if he can't get it he is going to guess. That seems like a fair way to find a winner in a match, but if you are trying to figure out which is stronger, a coin shouldn't be involved. When people do these matches, they want to find the winner, and they want to determine the best, but what always happens is: 1) A short match is lopsided (and perhaps statistically significant!), so people claim that the result was due to luck and discard it. 2) A longer match is very close (and statistically insignificant), so they take it as significant, because they think that more trials must mean more significance. What people want is to be able to make a strong statement with high confidence. People think that by doing more trials, they are automatically able to do this. But this is not true. All running more trials allows you to do is make *some kind* of statement with more confidence. It might be a weak statement. A match with fewer games *may* allow you to make the same (weak or strong) statement with the same degree of confidence, but people will never believe that. For example: 1) Play 200 times. If one wins 107 times or more, call that significant. 2) Play 32 times. If one wins 25 times or more, call that significant. My experiments indicate that these are about the same. You'll be correct or incorrect about the same percentage of the time in each case, and this works regardless of which player is actually stronger and by how much. The curves look very similar. The difference is that you are less likely to achieve 25 wins if you run 32 trials, than you are to achieve 107 wins if you run 200 trials. But if you do, it's no less significant. bruce >You see that when the programs are very close you need a very large number of >games to determine which is the best. > >On the other hand, if there is a significant difference before you reach 200 >games, it is possible to say which is the best without playing the 200 games. > > > > Christophe
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.