Author: Uri Blass
Date: 09:17:19 12/20/00
Go up one level in this thread
On December 19, 2000 at 18:26:21, Bruce Moreland wrote: >On December 18, 2000 at 21:04:37, Christophe Theron wrote: > >>On December 18, 2000 at 17:43:43, Severi Salminen wrote: >> >>>On December 18, 2000 at 10:48:49, Jorge Pichard wrote: >>> >>>>On December 18, 2000 at 09:55:42, Severi Salminen wrote: >>>> >>>>>>I agree with you that 24 games isn't enough, but 200 games is not really >>>>>>necessary if one of the two programs reach a difference of over 7 games, in >>>>>>which at that point I will stop the match. More likely this won't happen since >>>>>>these two programs are too evenly match so far. >>>>> >>>>>I don't understand. Where do you get that 7? Are you saying that the result >>>>>104-96 is significant? Or, even worse, 16-8 (this means nothing in practice)? >>>>>Why not 8, 25 or 10056? I think there is no point to stop when difference is >>>>>something. There _is_ a point to run a match with many games (500+). The closer >>>>>the two programs are the more games you need to show the true difference. Also >>>>>the learning abilities of both programs have to be taken in account. The chess >>>>>community still seems to lack the knowledge on how to measure the strenght >>>>>difference between two programs... >>>>> >>>>>Severi >>>> >>>>Okay I will run this tourney up to 200 games, and will post the result as soon >>>>as the tourney is over, or will Email the PGN games to anybody interested. >>> >>>That begins to sound interesting. 200 games match still has some error margins >>>but we'll see a lot from that result. I'm looking forward for the results - not >>>too often someone runs a 200+ match here in CCC, thanks! >>> >>>Severi >> >> >> >>On 200 games, the margin of error for 80% reliability is +/-3.5%. >>For 70% reliability it's +/-3.0%. >> >>If a program wins the 200 games match by 53.5% (107-93) or more, you can say >>with 80% relability that it is stronger than its opponent. >> >>If it wins by only 53% (106-94) you can say it is better, but only with 70% >>reliability. > >I don't believe this. > >If you know that one of them is 200 Elo points better than the other one, you >could figure out which one very accurately based upon this 107 wins thing, >because the better one would almost always win at least 107 out of 200. But >additionally you know that it would rarely lose 107 out of 200, which allows you >to make even stronger assertions. > >If they are very close together, small fractions of an Elo point, if one wins >107 times it tells you nothing. If you have A and B, and they are the same >strength, and you don't have the possibility of a draw, A will win 107 or more >about 18% of the time, and so will B. > >As you increase the known strength difference between A and B, you can with more >accuracy determine the strong one. > >For example, my experiments show that if there are about 11 Elo points between >them (one wins 66/128 of the games), and you get a score of 107 or more from a >200-game match, you'll mis-identify the stronger one only about 20% of the time. > >This corresponds with what you say, but if you decrease the difference to 5 Elo >points (67/128), you'd misidentify the stronger one about 1/3 of the time. > >I'm not a big statistics guy, but I can do experiments. I can't figure out how >people can come up with these very exact comments that don't seem to correspond >with reality. > >A big problem that I've never seen accounted for is draw percentage. When I've >done simulations, the draw percentage seems to make a big difference in the >probability that a given outcome is due to chance. > >My experiments with a coin flipping simulation indicate that the following >statements are more or less true (there is some probability of rounding error, >since I'm using integer math), based upon a single 200-game match, which returns >a score of at least 107 wins, at most 93 losses: > >"The apparently stronger side (the side that scored at least 107 wins) is 98% >likely to be no worse than 40 Elo points weaker than the side that scored 93 >wins. The apparently stronger side is 80% likely to be no worse than 15 Elo >points weaker than the apparently weaker side." > >These are much weaker statements than you make, but I think they are all you can >make. > >I think the title of this thread is very interesting. Essentially what it's >saying is that he will try for significance, but if he can't get it he is going >to guess. That seems like a fair way to find a winner in a match, but if you >are trying to figure out which is stronger, a coin shouldn't be involved. > >When people do these matches, they want to find the winner, and they want to >determine the best, but what always happens is: > >1) A short match is lopsided (and perhaps statistically significant!), so people >claim that the result was due to luck and discard it. > >2) A longer match is very close (and statistically insignificant), so they take >it as significant, because they think that more trials must mean more >significance. > >What people want is to be able to make a strong statement with high confidence. >People think that by doing more trials, they are automatically able to do this. >But this is not true. All running more trials allows you to do is make *some >kind* of statement with more confidence. It might be a weak statement. > >A match with fewer games *may* allow you to make the same (weak or strong) >statement with the same degree of confidence, but people will never believe >that. > >For example: > >1) Play 200 times. If one wins 107 times or more, call that significant. > >2) Play 32 times. If one wins 25 times or more, call that significant. > >My experiments indicate that these are about the same. You'll be correct or >incorrect about the same percentage of the time in each case, and this works >regardless of which player is actually stronger and by how much. The curves >look very similar. > >The difference is that you are less likely to achieve 25 wins if you run 32 >trials, than you are to achieve 107 wins if you run 200 trials. But if you do, >it's no less significant. I think that 25 out of 32 is more significant than 107 out of 200. It is logical to do an experiment without a fixed number of games to decide which program is stronger but I think that the rule to stop when the difference is 7 games is not a good rule. A better rule is to stop if you get one of the following result without counting draws 5-0,7-1,9-2,10-3,12-4,13-5,15-6,16-7,18-8,19-9,20-10,22-11,23-12,24-13,26-14 27-15,28-16,30-17,31-18,32-19,33-20,35-21,36-22,37-23 and stop when it is clear that no result out of these results is possible(for example if the result is 28-24). The results(5-0,7-1...) are based on the program who is better with 95% confidence. The practical confidence is smaller and I do not know of a good way to calculate it except simulation. The probability to get 5-0 for one program is 1/32 and it means that the probability to get 5-0 result between equal programs is 2/32 because both programs can win. The probability to get 7-1 between equal programs is also less than 1/10 but the probability to get one of the results 5-0 or 7-1 is bigger and I did not caclulate it. Calculating the probability to get one of the results 5-0,7-1... is a problem that I do not know of a way to solve it except simulation. Uri
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.