Author: Bruce Moreland
Date: 07:42:04 08/14/98
Go up one level in this thread
On August 14, 1998 at 06:08:36, Dan Homan wrote: >On August 12, 1998 at 09:15:19, Bruce Moreland wrote: > >> >>On August 11, 1998 at 06:52:02, Tony Hedlund wrote: >> >>> >>>>So I think 4-0 actually turns out to be a significant result. If you score 4-0, >>>>you can say that there is a very good chance that the one with the wins is >>>>better than the ones with the losses. >>>> >>>>You can't say this if you pick out a string of 4 wins in a row in the midst of a >>>>longer match, since you might be selecting a fluke case, but if you just start >>>>from scratch, and get 4-0, you should be able to stop. In fact I think you >>>>might be able to stop if you get 3.5 - 0.5, but I am less certain of this case. >>>>Someone who has more statistics than I may be willing to comment on this. >>> >>>Recently I played the match Shredder2 P200 MMX 64MB - Rebel8 P90 16MB. >>>Rebel won the first four games but Shredder won the match with 11-9. >> >>That shouldn't happen very often. >> >>bruce If this does happen more often than it should, perhaps some effort could be expended to figure out why. I don't know why it would happen more often than it should. >The problem with small number statistics is that they can be very >mis-leading. A 4-0 result in a 4 game match between nearly equal >programs (with 20% draw chances) happens about 1/40 th of the time. >A 3.5-0.5 (or better) result happens about 1/13 th of a time. If you want to be able to say, "program A is stronger than program B, by at least a little bit", would you rather have a 4-0 result or a 105-95 result? You should get a bogus result from 4-0 only 2.5% of the time by your calculation, how much do you want to bet that you'd get a bogus result from 105-95 a lot more of the time? So why do I think that most people would be more confident saying that A is stronger than B in the latter case? >If program A beats program B by a score of 4-0, this means that A has >a 97% (roughly) chance of being stronger than A. So it seems like a >pretty good bet that A is better than B, but consider the following >scenario. > >Say that you use this 4-game match technique to test new versions of >your program versus older versions. Whenever you make a change you >run one of these matches and decide to keep the change only if you get >a 4-0 result. Because you have a very well developed program, most >changes will have almost no effect on playing strength. Even changes >that do increase the playing strength slightly will not affect the >1/40 odds of getting a 4-0 result very much. So, you will get a >4-0 result 1/40 th of the time - regardless of whether the change >you make is good or bad. So using these 4-game matches to decide >on playing strength increases will cause you to randomly select >which versions to keep and which to discard. Interesting idea, but I'd like to bring up a case. Take this case, for instance. Say you make no change at all. It is the same code, compiled a second time, byte for byte equivalent, with the only difference being the date/time stamp on the executable. If you run these two programs against each other and get a 4-0 result, do you conclude that the newer one is stronger than the older one because of its different date/time stamp? Ok, let's say you don't get 4-0, but instead you get some other result which is also 97% significant, but which is based upon some larger number of trials. Do you have more or less confidence in this? You shouldn't, because the significance is still the same, which means that the odds you will get this result by chance should be the same. Maybe part of the reason people are worried about accepting the results of shorter matches is that it is a lot easier to mess up a chess game trial than it is to mess up a coin flip trial, I don't know. >So a 97% confidence isn't that helpful after all - at least not for >what we chess programmer do. The problem is that we are trying to >descriminate small differences in playing strength and 4-game match >just can't do that with any reliability. Neither can a larger match. If you do a huge match and get 105-95, you can't just say, "Boy, that was a long match, and one program came out with more points, it must be better." The odds in this case have to be more than 50%, but maybe not much more. bruce
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.