Author: Ricardo Gibert
Date: 21:29:42 02/18/01
Go up one level in this thread
On February 18, 2001 at 16:14:34, Peter Fendrich wrote: > >In another message Bruce Moreland raised the question >about how to interpret a match result. >With a match result between two players, he wanted to >express things like: >Player1 is better than Player2 with a certain significance >level. You can't do this with a *match* between the two players. It is possible that player A can consistently beat Player B in a match, but player B consistently has the higher rating as acquired against the general population of players. An example of how this would effect engine vs engine matches would be as follows: Program A has an opening book that is optimized to beat program B. Not surprisingly, program A edges out program B in any match played between the two, but if program B performs better than program A against other programs, then clearly program B is the "stronger" program even though program A is able to "dominate" B. When you measure the performance of player A against player B, you are measuring the ability of player A to "dominate" player B. You are *not* really measuring the playing strength of A. You do that by measuring its performance against a *population* of players. >The following algorithms will try to give a general method >to deal with and evaluate match results. The Bruce example >will show up in the end of this message. > >This is not found in the standard statistic course and I >haven't seen any description of this so I wanted to give it >a try because it should be rather straight forward. I have >to admit however that it gave me some headache before the >obvious solution appeared. > >My intention is to create a program based on these ideas, >maybe adjusted after your comments, and place that program >here or on any other well known site so we don't have to >struggle all this over and over again whenever some new >results drop in. :) > >If you don't want to read all this "crap" you can instead >get my little program and test it. In the end of this message >there is some more info. >Just send me an email and I'll return the source code and an -exe file. > >Be patient because the program isn't written yet but very soon!!! > >Some basic definitions and formulas >=================================== >The real problem here is when we have small number of games. >When we have a lot of games he normal distribution could be used. >We will use the Trinomial distribution which is the only one >giving the exact probabilities when the outcome has three distinct >outcomes. > >N = number of games >W = number of wins (for Player1) >D = number of draws >L = number of losses (for Player1) >Pw = probability of a win (for Player1) >Pd = probability of a draw >Pl = probability of a loss (for Player1) >hence N = W+D+L and Pw+Pd+Pl = 1.0 > >P = probability of a certain match result with exactly W, L and D >cumP = probability of a certain match result or better (for Player1) > This function is often used but we will avoid it here. >With the help of the Trinomial distribution we can get the P if we >know the values of both Pw and Pd: >P(N,W,D,Pw,Pd) = (N! / W!D!L!) Pw^W * Pd^D * Pl^L where L=N-W-D > >One problem, of course, is that we don't know the real values of >Pw, Pd and Pl. > >The Question to answer >====================== >Given a match result W, D and L between Player1 and Player2. > >H0: Player1 better than Player2 or in other words > Pw+Pd/2 > 0.5 >P(H0) is the probability that H0 is true. (I don't care about rejecting >H0 because it's not necessary) > >Let's start with the assumption that we don't have any prior knowledge >of the distribution of Pw and Pd. That means that they are evenly >distributed within the area formed by Pw and Pd each as an axis between >0 to 1 and where Pw+Pd <= 1. > >As said before the probability that a certain game result will occur >given that Pw and Pd is computed from the Trinomial function. > >Now let's look at the analytical solution. > >First compute the integral of that Trinomial function (with W,D and >L fixed) over both Pw and Pd. Pw is in interval 0 to 1 and Pd is in >the interval 0 to 1-Pw for each Pw. We now have a double integral that >is the complete space of all probabilities of all possible combinations >of Pw and Pd for a specific match result. >Let's call this result Pcomplete. > >Secondly let's take the same integral for all Pw+Pd/2 > 0.5. It means >that Pw is in the interval 0 to 1 and Pd from max(0, 1-2Pw) to (1-Pw) >for each Pw. This integral is the space of all probabilities of all >possible combinations of Pw and Pd when Player1 is the best one >(Pw+Pd/2 > 0.5). >Let's call this result Pbetter. > >The final conclusion >==================== >The probability that Player1 is better than Player2 is Pbetter/Pcomplete. > >A numeric method >================ >Maybe the theory above is hard to grasp but the following is a computer >simulation that should give the same result in the long run. > >1. We have a result, let's say 7-3 with W=6, D=2 and L=2. > Declare 2 counters. One representing A better than B and the > other B is better than or equal to A. >2. Pick a pair of Pw, Pd randomly so that 0 <= Pw+Pd <= 1. >3. Compute P = Trinomial(10, 6, 2, Pw, Pd) >4. Play a match with 10 games where each game has a random outcome > weighted by P. >5. If (the match result is 7-3 with W=6, D=2 and L=2) > If (Pw+Pd/2 > 0.5) increment counter1 > else increment counter2 > else do nothing >6. Loop 2-5 MANY MANY times. >7. Compute counter1/(counter1+counter2) > >This will get us the probability for the match result with no prior >knowledge of Pw and Pd. If we know anything at all about Pw or Pd we >could let the random picking under 2 reflect this. >In order to make this code a bit more effective we could change 4 and 5 >to >4. If (Pw+Pd/2 > 0.5) Add P to counter1 (must now be a double) > else Add P to counter2 (also a double) >5. No operation >We will get the same result but faster. >In order to reflect any prior knowledge we could use weights in the >integrals above to capture that knowledge. >In fact the last algorithm is equal to how the integral above would >be computed with the Monte Carlo method. The alg given here is a perfect >simulation of the integral that was showed to be the solution. > >My little program >================= >If we could agree of the above theory then it is best documented >in a program. >Given a match result the pgm will produce: > - What is the probability to get such a result depending. on the > strength difference between Player1 and 2 > - What is P(H0) meaning the probability that Player1 is better than Player2. > >When it works I would like to extend the pgm with some other functions. >Send me an email and I will return the program to anyone interested. > >A few other comments >==================== >I don't rely on figures like this, only! >The statistics used here is based on the match outcome only. There is >a lot of other information that can't covered by the bare match result >numbers. By looking at the individual games we will get a lot of other >information that is hard to include in statistic methods. Patterns of >all kinds, like Player2 is broken and always loses in 4 moves or whatever, >might be more valuable than the result 10-0 only. > >Furthermore I'm not that found of this A better than B hypothesis. I am >more found of other methods like confidence intervals. Maybe we could >include that as well in the program later on. > >For matches counting many games, like 100 or more, we could use the Normal >distribution instead of the described number crunching but exact method. >It should give about the same results. The Normal distribution offers a >variety of useful functions and is the basis for the ELO system. > >When dealing with match results and not knowing the number of >draws we could set D to either some common sense value of the number >of draws or the worst case which is as few draws as possible that still >follows the match result. >The result 60-40 with W=60, D=0, L=40 is more "spread out" and shows >more variance, meaning less accurate, than with W=20, D=80, L=0. See the next >topic. > >The "Bruce example" >=================== >We have two results 10-0 and 60-40 respectively. Which one is the most >'valid'? Let's translate 'is the most valid' with 'has the highest >probability that A is better than B'. We will see that it dependes of the >number of draws in the 60-40 match. >The following figures are computed by my program and are subject to errors >that I haven't tried to estimate yet. At least 4 decimals should be ok. > >1) 10 games between A and B. Result 10-0 > >Let's try some different values of Pw and Pd > >10 games. 10-0-0, Result: 10-0 >A few combinations of Pw and Pd: >P(Win) P(Draw) P(res) cumP >------ ------- ----------- --------- >0.33 0.33 0.000016935 0.0000169 >0.40 0.20 0.000104858 0.0001049 >0.50 0.00 0.000966841 0.0009668 >0.37 0.26 0.000048086 0.0000481 >0.33 0.35 0.000013147 0.0000131 >0.01 0.99 0.000000000 0.0000000 >0.80 0.10 0.107374182 0.1073742 > >Probability for A better than B is: 0.9995087 > >2) 100 games between A and B. Result: 60-40 > (Nothing is said about number of draws. Let's try two cases) > > a) No draws in the match > 100 games. 60-40-0, Result: 60-40 > A few combinations of Pw and Pd: > P(Win) P(Draw) P(res) cumP > ------ ------- --------------- ---------------- > 0.33 0.33 2.66726982e-020 0.008335829 > 0.40 0.20 2.20893466e-012 0.01447611 > 0.50 0.00 0.00981144545 0.02784797 > 0.37 0.26 9.08565295e-016 0.01155849 > 0.33 0.35 2.12089826e-021 0.007664598 > 0.01 0.99 1.08438667e-202 2.350298e-026 > 0.80 0.10 2.10660425e-018 1.000000000 > > Probability for A better than B is: 0.9759706 > > b) Maximum number of draws (80) > 100 games. 20-0-80, Result: 60-40 > A few combinations of Pw and Pd: > P(Win) P(Draw) P(res) cumP > ------ ------- --------------- ----------------- > 0.33 0.33 1.03987806e-027 0.008335829 > 0.40 0.20 7.12444101e-044 0.01447611 > 0.50 0.00 5.01027042e-226 0.02784797 > 0.37 0.26 1.95458083e-035 0.01155849 > 0.33 0.35 3.10643366e-026 0.007664598 > 0.01 0.99 2.2875309e-026 2.350298e-026 > 0.80 0.10 6.17946754e-062 1.000000000 > > Probability for A better than B is: 0.9999995 > >...and in the end >================= >Thanks for reading all this. I hope there are no typos or logical >errors, but no warranties at all are given! > >The word "complete" in the subject was meant to provoke a little. >There are a lot of other approaches of course. >I once wrote an article in the "Swedish ICCA" (now SSDF) magazine >PLY, about this topic. That article was completely based on the >normal distribution and later became the basis for how SSDF computes >the ELO's and the corresponding confidence levels. >That was a long time ago and modern methods are more or less based on >computer power. >This time I had good help of my son David who opened my eyes for >computer simulations as a method to find the analytical conclusion. >Maybe more practical and with some interesting possibilities. >...and fascinating! > >Regards >//Peter
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.