Author: Pat King
Date: 15:21:56 03/27/04
Go up one level in this thread
On March 26, 2004 at 20:14:15, Sune Fischer wrote: >On March 26, 2004 at 04:54:57, Uri Blass wrote: > >>On March 24, 2004 at 17:31:35, Dann Corbit wrote: >> >>>On March 24, 2004 at 16:53:08, Uri Blass wrote: >>>[snip] >>>>The difference is more important and 10-0 is clearly more telling than 19-11 >>> >>>It is stronger, but less reliable. >> >>No 10-0 is clearly more reliable than 19-11 > >The interesting question is if 10-0 is more "reliable" than 20-10, and it isn't. From a statstical viewpoint, it is. 10-0 far exceeds 99% confidence, whereas 20-10 doesn't quite reach 95% confidence (see my table of "significant" wins elsewhere in this thread). > >One can prove that draws are not important if one is only interested in knowing >which one is better. Now this is an interesting point. My statistical anlyses have assumed decisive games, because in my testing I've come across very few draws in C to C games. Is your assertion "draws are not important" because (for instance) your 20-10 result is really a 15-5-10 result (which I think would reach my binomial threshold), or is there some sort of "trinomial" distribution out there that I should be aware of? > >Note this is not to be confused with the question of how much difference in >strength there is. >It's two very different questions. A point I granted Dann elsewhere in the thread. > >>It usually will not happen but it does not mean that it is less reliable when >>it >>happens(you may suspect that something in the conditions is wrong when you see >>10-0 but if you see that no program was significantly slower in nps during the >>match than you can safely stop the match after 10-0 and say that the new program >>is better). > >If I suspect something might be wrong I will stop the match and investigate, but >one can easily imagine 10-0 or similar under proper conditions. Like newbie engines with no book (repeating the same test 10 times). Like, perhaps, a very small or very bad book. Like, what else? > >In fact yesterday I played a match where the score after 15 games was 13.5-1.5 >in favor of the new version. >It actually ended up losing the match by 49-51 :( I would argue, based on your final result, that you cannot conclude anything about the difference between your two programs. I certainly wouldn't throw out your new version on that result. That you got to a 12 game difference and then ended up with an inconclusive result is highly unusual, but of course not impossible. I most likely would have stopped testing at or before the 13.5-1.5 game point, and at the 100 game point, I don't think you can prove I would have made a mistake. > >Honestly I do not remember having seen such a drastic score difference before, >but I do regularly see a sequence of 6-8 straight wins by one of the engines in >100 game match, so it's not impossible to imagine this might occur at the >beginning of the match. > >-S. When you're playing 100 game matches, and therefore have seen 1000s of games, I'm not surprised you've seen 6-8 game streaks. But I would argue that you haven't learned anything by playing such long matchs. Yes, those streaks might make me accept a bad new version. But after 100 games, you can only still say that I "might" have made a mistake. One weakness to my testing method (go until you get a "significant" result (I use 95% confidence) or until I get bored), is that it smacks of self-selection. If one waits long enough, chance will ensure the answer one wants. So picking a 30 or 100 game limit seems a reasonable safeguard against this. What to do with your inconclusive result in such cases is another matter. If your 49-51 result were testing your implementation of null move, I'd be worried. If you were only messing around with some eval weights, I'd be reassured that I hadn't broken anything TOO badly. Bottom line, this stuff is hard :) Pat
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.