Author: Bruce Moreland
Date: 18:00:27 08/09/98
Go up one level in this thread
On August 09, 1998 at 09:59:47, Christophe Theron wrote: >On August 07, 1998 at 14:30:32, Thorsten Czub wrote: >>The best thing would be: 30 games against a, 30 games against b, 30 games >>against c etc. But often there is not much time. >> >>So you maybe only play 10 games. > >NEVER. 10 games is NOT ENOUGH. > >I have found that a 30 games match has a +/- 5% error margin. That is a 50% >result can mean 45% to 55%. > >So a 10 games match gives NO USEFUL INFORMATION (unless you are playing Chess >Challenger 3 against Rebel 10). > >Of course if it is 10 games against A, 10 against B and 10 against C, it is a 30 >games match against an average opposition and it is useful. I suppose it is what >you meant? You guys are both not quite right, I think, although Christophe's comment about Rebel vs Challenger 3 is pretty close. If you are trying to prove that program A is stronger than program B (but not by how much it is stronger), then sometimes 30 games is nowhere near enough, and sometimes 10 games is way too many. If you are trying to test to see if a coin is more apt to come up heads than tails, you can never know for absolute sure by flipping it, since any result is possible. But you can determine what the chance was that you would have a particular result with a fair coin, and if the chance was low enough, you can say that the coin probably isn't fair. You can start flipping the coin, and if at any point you get a result that indicates that the coin is unfair, you can stop. For instance, you may decide to consider the coin to be unfair if you get a result that will happen less than 5% of the time if the coin is fair. If you flip the coin five times and it comes up heads all five times, the odds are 1/32 that this will happen with a fair coin, and this is less than 5%, so you can conclude that the coin isn't fair. You can say, "this coin is at least slightly more likely to come up heads than tails." If you don't get such a dramatic result, and there are tails mixed in with the heads, it will take you a much longer time to find an unfair coin. But when you finally do get a 95% confidence that the coin is unfair, you are not one bit more certain that the coin is unfair than if you flipped it five times and it came up heads each time. Chess isn't coin flipping. White has a better chance of winning than black, and there are draws. I don't know exactly what affect this has, but I think the existence of draws decreases the number of trials you need to get signficance if you have a wipeout result. So I think 4-0 actually turns out to be a significant result. If you score 4-0, you can say that there is a very good chance that the one with the wins is better than the ones with the losses. You can't say this if you pick out a string of 4 wins in a row in the midst of a longer match, since you might be selecting a fluke case, but if you just start from scratch, and get 4-0, you should be able to stop. In fact I think you might be able to stop if you get 3.5 - 0.5, but I am less certain of this case. Someone who has more statistics than I may be willing to comment on this. Now, if you do 30 games, you might think you are safe, but you are probably not. I'm sure you can get some results where one side wins by a few games, perhaps even quite a few games, and you still may not have proven with reasonable confidence that one program is stronger than the other. And remember, that what I'm talking about is "stronger", not "markedly stronger". If you get 4-0 you don't prove that one program is hundreds of points stronger than the other one, just that it is at least slightly stronger. I'm sure there is a way you can say, "I have shown that there is a 95% chance that program A is at least 30 Elo points stronger than program B", but I'm not sure exactly how to do it with confidence. And I think that in practice, in order to show this, program A has to beat program B pretty badly, although less badly if you do more trials. Some simple statistical concepts were regarded as top secrets during World War II, because they allowed researches to prove that one drug was effective with sometimes very few trials. Not everyone had figured this out by then. bruce
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.