Author: Peter Fendrich
Date: 10:59:52 08/15/98
Go up one level in this thread
On August 14, 1998 at 20:04:11, Bruce Moreland wrote: >My reason for posting on this topic is that people seem to think that they can >do an N-game match, with some suitably comforting value of N, and take the >results as signficant, which in this case means, I guess, truthful, regardless >of the score. > >I suspect that in matches that are fairly close, which most of them will be (I >think), that you will end up having, for lack of a better term (yet), a range of >not incredibly unlikely error which exceeds the Elo delta that can be computed >from the score of the match. No objections... There is a couple of ways to deal with match results without using ELO at all. Look in the end of this posting for an example. >I think that most close matches are likely to produce an inconclusive result, >rather than a hard-fought and exhausting match where "the best program won". > >I think the amount by which you might be mistaken would decrease if you ran more >trials, but the score of a match between two approximately equal programs would >tend to tighten up, as well. You might be better able to determine that "A and >B aren't too much different", but it still could be a stretch to say "A is >better than B". > >I don't have any problem with the matches themselves, only with the conclusions. That depends of the conclusion. In fact a close 200 game match gives a good confidence of that A and B are close in strength! >A 4-0 blowout *should* be a rare thing, and even though the error margin is >large, it is still a massive blowout. It might be interesting to find out how >often it happens between roughly equal programs, it should happen just a few >percent of the time, depending upon draw percentage (less draws means it should >happen more often). > >I would love to hear from anyone who is competent in this area, who could tell >me with authority where I am messing up. > >I freely admit I might be wrong, and I've heard from several people who think >that 4-0 is pretty common and means nothing, but I really would like to figure >out *why*, since this should be rare between equal programs. A 4-0 game result is enough common to make me at least suspicious and look more on the games themself to see if something there contradicts the result. Furthermore the 4-0 result is a very very special case where the zero-part gives you an open end when making conclusions. See my answer to Dan Homan down in this thread. WARNING! Don't read the follwoing if you hate formulas! :) The case you had before with the result 105-95 between A and B, I would use like this: 1. Suppose the result is W=95 won, D=20 draws and L=85 lost games for A. (This really makes a difference) 2. Compute m = 105/200=0.525 (n=200 is the number of games) m is the expected result of each game. So the expected result from the next 100 games is 52.5 - 47.5 3. Compute the 'standard deviation' for m s = SQRT( (W*(1 - m)**2 + D*(0.5 - m)**2 + L*(0 - m)**2)/(n-1) where SQRT is the square root and **2 means powered by 2. with our numbers above this will give: s = SQRT( (95*(0.475)**2 + 20*(-0.025)**2 + 85*(-0.525)**2)/199) = 0.475 You could say that s gives a meassurement of how "spread out" the result is around the mean value of m. A close game result with mostly draws is more "stable" than no draws at all and would give more confidence. 4. Compute the confidence: Interval size: 1.96 * s / SQRT(n), where 1.96 gives a 95% confidence. (1.64 would give 90% and 2.58 gives 99% confidence. You can find the values in tables for normal distribution) With our numbers: 1.96 * 0.475/SQRT(200) = 0.066 So the lower limit will be 0.575 - 0.066 = 0.509 the upper limit will be 0.575 + 0.066 = 0.641 5. Conclusion: With 95% confidence I say that "A will in the long run get between 0.509 and 0.641 points in each game against B". (I know that one game cant end like this! This is mean values.) A fairly good chance that A is better than B in matches between them, isn't it? If you rather would like a hypothesis like "A is better than B" with an X% confidence you can use the same formulas with a different approach. Look under "Null hypothesis" in your book and use the hypothesis m > 0.5. With n games, just compute what X will it be. If you don't find it in your book I will be glad to write it down. It's a little story by itself... //Peter
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.