Author: Sune Fischer
Date: 16:17:27 03/27/04
Go up one level in this thread
>>One can prove that draws are not important if one is only interested in knowing >>which one is better. > >Now this is an interesting point. My statistical anlyses have assumed decisive >games, because in my testing I've come across very few draws in C to C games. Is >your assertion "draws are not important" because (for instance) your 20-10 >result is really a 15-5-10 result (which I think would reach my binomial >threshold), or is there some sort of "trinomial" distribution out there that I >should be aware of? There might be :) Have a look at RĂ©mi Coulom's paper, I think it is called "who is better" from november 24, 2002. It explains, to those who can follow the math, why draws do not count. >>If I suspect something might be wrong I will stop the match and investigate, but >>one can easily imagine 10-0 or similar under proper conditions. > >Like newbie engines with no book (repeating the same test 10 times). >Like, perhaps, a very small or very bad book. >Like, what else? First of all, I have a good imagination :) Secondly, I've see plenty of matches where one engine leads by 10 points, then gets overtaken and is behind 5 points and comes back to win by 5 points. Such things happen in long matches, so I wouldn't put any trust in 10 games, personally. Just try flipping a coin 10 times and see how often you get a 5-5 result, getting 6-4/4-6 has a higher probability, even 7-3 is IIRC more likely. >> >>In fact yesterday I played a match where the score after 15 games was 13.5-1.5 >>in favor of the new version. >>It actually ended up losing the match by 49-51 :( > >I would argue, based on your final result, that you cannot conclude anything >about the difference between your two programs. I certainly wouldn't throw out >your new version on that result. That you got to a 12 game difference and then >ended up with an inconclusive result is highly unusual, but of course not >impossible. I most likely would have stopped testing at or before the 13.5-1.5 >game point, and at the 100 game point, I don't think you can prove I would have >made a mistake. You would have concluded the new version was clearly stronger instead of concluding that they are probably very close. I would call that a mistake. >When you're playing 100 game matches, and therefore have seen 1000s of games, >I'm not surprised you've seen 6-8 game streaks. But I would argue that you >haven't learned anything by playing such long matchs. Yes, those streaks might >make me accept a bad new version. But after 100 games, you can only still say >that I "might" have made a mistake. > >One weakness to my testing method (go until you get a "significant" result (I >use 95% confidence) or until I get bored), is that it smacks of self-selection. The problem is often that the diffences are very small, we may be talking about 5-10 elo, in that case it is expected that they will be close. I've see these confidence tables, it's usually something a la if you lead by 30 points after 100 games you have 95% confidence. However we never get a 30 point lead in a 100 game match between almost equal engines, so getting to 95% is usually not possible. >If one waits long enough, chance will ensure the answer one wants. So picking a >30 or 100 game limit seems a reasonable safeguard against this. > >What to do with your inconclusive result in such cases is another matter. If >your 49-51 result were testing your implementation of null move, I'd be worried. >If you were only messing around with some eval weights, I'd be reassured that I >hadn't broken anything TOO badly. > >Bottom line, this stuff is hard :) Definitely. If you want to revolutionize computer chess don't try and invent a new algorithm, rather find a way to quickly test for small improvements! :) -S. >Pat
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.