Author: James T. Walker
Date: 13:05:24 09/04/99
Go up one level in this thread
On September 04, 1999 at 01:41:02, Bruce Moreland wrote: >On September 03, 1999 at 03:51:05, Harald Faber wrote: > >>On September 03, 1999 at 03:16:50, Bruce Moreland wrote: >> >>>On September 02, 1999 at 20:02:55, Heiko Mikala wrote: >>> >>>>And I say what you're saying is clearly wrong. Believe me, I learned this the >>>>hard way during the last ten years of work on my own chess program. I often had >>>>the case that in a first test match of about 30-40 games my program convincingly >>>>won a match, than let it play another, longer match overnight and during the >>>>next day, which it than lost. You always need the same amount of games, no >>>>matter how the score is after a first, short match. My experience after hundreds >>>>of test matches shows, that you need at least 70-80 games to be able to come to >>>>a conclusion. And you need some hundred games to be sure. Even if the first 15 >>>>games end in an 15-0 score. Because the next 15 games may end 0-15. This is a >>>>frustrating fact, but it is *a fact*. It's frustrating, because for us as >>>>programmers it means, that we have to do much more time consuming testing than >>>>we would like to do. >>> >>>It shouldn't work like this. You can't take a selection from somewhere in the >>>middle of a long run of games, and use that to prove anything, but if you start >>>out and play some games, and one program wins a several games in a row, you >>>should be able to make a safe conclusion. >>> >>>I would really like to understand the 15-0 and 0-15 situation. That should >>>*not* happen. That's not how math should work. If you flip a coin 15 times and >>>it comes up heads each time, the odds of this happening completely by chance are >>>extremely small. The odds that it would then come up tails 15 times in a row >>>are also extremely small, and combined they should be vanishingly small. >> >>Maybe Heiko exaggerated a bit but many people have seen 9-1, 8.5-1.5 or s.th. >>like these results in a row where the outcome was much different, more equal... >> >>That was the point. > >There needs to be a reason for this. There does exist a rating point delta >between two programs. Let's say that one program "should" score 75% against >another program, which is another way of saying that it is 200 points stronger. > >But there is a problem right away, there may not be very many sufficiently >unique games the programs can play. If my book is 1. e4 and yours is 1. ... c5, >and we are both out of book at that point, perhaps the weaker program will win, >and it will win the repeated game, etc., forever. > >With larger books than one or two moves each this should be less of a problem, >but it will still exist. If you talk about playing a thousand games, you are >going to get some duplicates. > >But then we have the concept of book learning, and what that does to the sample. > >OK. I ask you to discard all of these issues and assume that every time the >programs play, they play a unique game. Perhaps we need to look into those >issues, but I'm not the one with the PGN. > >If the programs play 100% unique games, this notion of "real" rating point delta >should kick in. You should see wins, losses, and draws, in the proper >proportion. If you play some huge number of games, you should get an almost >exact correspondence between the real rating point delta and the result, if you >take this huge sample *as a whole*. > >Of course, in this huge game sample, you will get sub-samples that can be used >to prove anything that you wish to prove. You can't do this, you can't take a >section out and say that somehow all small samples are suspect, I think. The >laws of probability dictate that such samples will exist, and nothing can be >done about this. > >There is a chance, always, that you'll encounter one of these crazy samples >right at the start. But the odds that you get an extreme result due to engine >superiority should also be calculable. For any sample, you should be able to >determine two numbers: 1) the number of rating points by which one engine is >superior to the other, 2) the percent chance that the engine is really superior >to the other one by at least that amount. It is not necessarily true that a >larger sample allows you to state that one engine is better than another with >more confidence than a smaller sample does. > >This whole discussion should map onto coin testing, and that is why we're having >this argument. We are all doing these coin flipping matches, perhaps ten flips >per match. The coins are all approximately fair, so you should see a 10-0 >result maybe twice out of a thousand. But it seems like people have the >impression that these results are being reported a lot more than they should. > >I don't understand why this is, but I'd like to. Perhaps it is because people >are running 100-flip matches and reporting on these extreme 10-flip sub-matches, >of which there are a *lot*. You get 91 of those per 100-flip match. The odds >of at least one 10-0 or 0-10 sub-match in a 100-flip match are about 16%. That >is a lot more than 0.2% > >Long matches can allow you to make stronger assertions with more confidence, but >sometimes the reverse is true. The odds that the weaker program will win a >match should always be calculable, and with a long match that has a close score, >the odds end up being *higher* that the weaker program won, than in the case of >a short match that is a wipe out. > >I haven't done the math, but it is probably true that you actually have less >chance of making a mistake when you say that A is stronger than B, if A wins a >match 6-0, than if A wins by 55-45. > >I think it is important that it be considered that a larger sample may give a >false sense of security, in the case of a close match. All a long close match >proves is that one program is not massively stronger than the other, it does not >prove which of them is stronger. > >bruce Hello Bruce, I think I understand what you are saying but I also think your last couple of paragraphs are contrary to stastical probability. I believe that stastics show that the more data you have the more probable that you conclusion is correct. Jim Walker
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.