Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Match results - a complete(!) theory (long)

Author: Ricardo Gibert

Date: 21:29:42 02/18/01

Go up one level in this thread


On February 18, 2001 at 16:14:34, Peter Fendrich wrote:

>
>In another message Bruce Moreland raised the question
>about how to interpret a match result.
>With a match result between two players, he wanted to
>express things like:
>Player1 is better than Player2 with a certain significance
>level.

You can't do this with a *match* between the two players. It is possible that
player A can consistently beat Player B in a match, but player B consistently
has the higher rating as acquired against the general population of players.

An example of how this would effect engine vs engine matches would be as
follows: Program A has an opening book that is optimized to beat program B. Not
surprisingly, program A edges out program B in any match played between the two,
but if program B performs better than program A against other programs, then
clearly program B is the "stronger" program even though program A is able to
"dominate" B.

When you measure the performance of player A against player B, you are measuring
the ability of player A to "dominate" player B. You are *not* really measuring
the playing strength of A. You do that by measuring its performance against a
*population* of players.

>The following algorithms will try to give a general method
>to deal with and evaluate match results. The Bruce example
>will show up in the end of this message.
>
>This is not found in the standard statistic course and I
>haven't seen any description of this so I wanted to give it
>a try because it should be rather straight forward. I have
>to admit however that it gave me some headache before the
>obvious solution appeared.
>
>My intention is to create a program based on these ideas,
>maybe adjusted after your comments, and place that program
>here or on any other well known site so we don't have to
>struggle all this over and over again whenever some new
>results drop in. :)
>
>If you don't want to read all this "crap" you can instead
>get my little program and test it. In the end of this message
>there is some more info.
>Just send me an email and I'll return the source code and an -exe file.
>
>Be patient because the program isn't written yet but very soon!!!
>
>Some basic definitions and formulas
>===================================
>The real problem here is when we have small number of games.
>When we have a lot of games he normal distribution could be used.
>We will use the Trinomial distribution which is the only one
>giving the exact probabilities when the outcome has three distinct
>outcomes.
>
>N  = number of games
>W  = number of wins         (for Player1)
>D  = number of draws
>L  = number of losses       (for Player1)
>Pw = probability of a win   (for Player1)
>Pd = probability of a draw
>Pl = probability of a loss  (for Player1)
>hence N = W+D+L and Pw+Pd+Pl = 1.0
>
>P    = probability of a certain match result with exactly W, L and D
>cumP = probability of a certain match result or better (for Player1)
>       This function is often used but we will avoid it here.
>With the help of the Trinomial distribution we can get the P if we
>know the values of both Pw and Pd:
>P(N,W,D,Pw,Pd)  = (N! / W!D!L!) Pw^W * Pd^D * Pl^L  where L=N-W-D
>
>One problem, of course, is that we don't know the real values of
>Pw, Pd and Pl.
>
>The Question to answer
>======================
>Given a match result W, D and L between Player1 and Player2.
>
>H0: Player1 better than Player2 or in other words
>    Pw+Pd/2 > 0.5
>P(H0) is the probability that H0 is true. (I don't care about rejecting
>H0 because it's not necessary)
>
>Let's start with the assumption that we don't have any prior knowledge
>of the distribution of Pw and Pd. That means that they are evenly
>distributed within the area formed by Pw and Pd each as an axis between
>0 to 1 and where Pw+Pd <= 1.
>
>As said before the probability that a certain game result will occur
>given that Pw and Pd is computed from the Trinomial function.
>
>Now let's look at the analytical solution.
>
>First compute the integral of that Trinomial function (with W,D and
>L fixed) over both Pw and Pd. Pw is in interval 0 to 1 and Pd is in
>the interval 0 to 1-Pw for each Pw. We now have a double integral that
>is the complete space of all probabilities of all possible combinations
>of Pw and Pd for a specific match result.
>Let's call this result Pcomplete.
>
>Secondly let's take the same integral for all Pw+Pd/2 > 0.5. It means
>that Pw is in the interval 0 to 1 and Pd from max(0, 1-2Pw) to (1-Pw)
>for each Pw. This integral is the space of all probabilities of all
>possible combinations of Pw and Pd when Player1 is the best one
>(Pw+Pd/2 > 0.5).
>Let's call this result Pbetter.
>
>The final conclusion
>====================
>The probability that Player1 is better than Player2 is Pbetter/Pcomplete.
>
>A numeric method
>================
>Maybe the theory above is hard to grasp but the following is a computer
>simulation that should give the same result in the long run.
>
>1. We have a result, let's say 7-3 with W=6, D=2 and L=2.
>   Declare 2 counters. One representing A better than B and the
>   other B is better than or equal to A.
>2. Pick a pair of Pw, Pd randomly so that 0 <= Pw+Pd <= 1.
>3. Compute P = Trinomial(10, 6, 2, Pw, Pd)
>4. Play a match with 10 games where each game has a random outcome
>   weighted by P.
>5. If (the match result is 7-3 with W=6, D=2 and L=2)
>       If (Pw+Pd/2 > 0.5) increment counter1
>       else increment counter2
>   else do nothing
>6. Loop 2-5 MANY MANY times.
>7. Compute counter1/(counter1+counter2)
>
>This will get us the probability for the match result with no prior
>knowledge of Pw and Pd. If we know anything at all about Pw or Pd we
>could let the random picking under 2 reflect this.
>In order to make this code a bit more effective we could change 4 and 5
>to
>4. If (Pw+Pd/2 > 0.5) Add P to counter1 (must now be a double)
>   else Add P to counter2 (also a double)
>5. No operation
>We will get the same result but faster.
>In order to reflect any prior knowledge we could use weights in the
>integrals above to capture that knowledge.
>In fact the last algorithm is equal to how the integral above would
>be computed with the Monte Carlo method. The alg given here is a perfect
>simulation of the integral that was showed to be the solution.
>
>My little program
>=================
>If we could agree of the above theory then it is best documented
>in a program.
>Given a match result the pgm will produce:
>  - What is the probability to get such a result depending. on the
>    strength difference between Player1 and 2
>  - What is P(H0) meaning the probability that Player1 is better than Player2.
>
>When it works I would like to extend the pgm with some other functions.
>Send me an email and I will return the program to anyone interested.
>
>A few other comments
>====================
>I don't rely on figures like this, only!
>The statistics used here is based on the match outcome only. There is
>a lot of other information that can't covered by the bare match result
>numbers. By looking at the individual games we will get a lot of other
>information that is hard to include in statistic methods. Patterns of
>all kinds, like Player2 is broken and always loses in 4 moves or whatever,
>might be more valuable than the result 10-0 only.
>
>Furthermore I'm not that found of this A better than B hypothesis. I am
>more found of other methods like confidence intervals. Maybe we could
>include that as well in the program later on.
>
>For matches counting many games, like 100 or more, we could use the Normal
>distribution instead of the described number crunching but exact method.
>It should give about the same results. The Normal distribution offers a
>variety of useful functions and is the basis for the ELO system.
>
>When dealing with match results and not knowing the number of
>draws we could set D to either some common sense value of the number
>of draws or the worst case which is as few draws as possible that still
>follows the match result.
>The result 60-40 with W=60, D=0, L=40 is more "spread out" and shows
>more variance, meaning less accurate, than with W=20, D=80, L=0. See the next
>topic.
>
>The "Bruce example"
>===================
>We have two results 10-0 and 60-40 respectively. Which one is the most
>'valid'? Let's translate 'is the most valid' with 'has the highest
>probability that A is better than B'. We will see that it dependes of the
>number of draws in the 60-40 match.
>The following figures are computed by my program and are subject to errors
>that I haven't tried to estimate yet. At least 4 decimals should be ok.
>
>1) 10 games between A and B. Result 10-0
>
>Let's try some different values of Pw and Pd
>
>10 games. 10-0-0, Result: 10-0
>A few combinations of Pw and Pd:
>P(Win)  P(Draw) P(res)  	cumP
>------  ------- -----------     ---------
>0.33    0.33    0.000016935     0.0000169
>0.40    0.20    0.000104858     0.0001049
>0.50    0.00    0.000966841     0.0009668
>0.37    0.26    0.000048086     0.0000481
>0.33    0.35    0.000013147     0.0000131
>0.01    0.99    0.000000000     0.0000000
>0.80    0.10    0.107374182     0.1073742
>
>Probability for A better than B is: 0.9995087
>
>2) 100 games between A and B. Result: 60-40
>   (Nothing is said about number of draws. Let's try two cases)
>
> a) No draws in the match
>    100 games. 60-40-0, Result: 60-40
>    A few combinations of Pw and Pd:
>    P(Win)  P(Draw) P(res)  	    cumP
>    ------  ------- --------------- ----------------
>    0.33    0.33    2.66726982e-020 0.008335829
>    0.40    0.20    2.20893466e-012 0.01447611
>    0.50    0.00    0.00981144545   0.02784797
>    0.37    0.26    9.08565295e-016 0.01155849
>    0.33    0.35    2.12089826e-021 0.007664598
>    0.01    0.99    1.08438667e-202 2.350298e-026
>    0.80    0.10    2.10660425e-018 1.000000000
>
>    Probability for A better than B is: 0.9759706
>
> b) Maximum number of draws (80)
>    100 games. 20-0-80, Result: 60-40
>    A few combinations of Pw and Pd:
>    P(Win)  P(Draw) P(res)          cumP
>    ------  ------- --------------- -----------------
>    0.33    0.33    1.03987806e-027 0.008335829
>    0.40    0.20    7.12444101e-044 0.01447611
>    0.50    0.00    5.01027042e-226 0.02784797
>    0.37    0.26    1.95458083e-035 0.01155849
>    0.33    0.35    3.10643366e-026 0.007664598
>    0.01    0.99    2.2875309e-026  2.350298e-026
>    0.80    0.10    6.17946754e-062 1.000000000
>
>    Probability for A better than B is: 0.9999995
>
>...and in the end
>=================
>Thanks for reading all this. I hope there are no typos or logical
>errors, but no warranties at all are given!
>
>The word "complete" in the subject was meant to provoke a little.
>There are a lot of other approaches of course.
>I once wrote an article in the "Swedish ICCA" (now SSDF) magazine
>PLY, about this topic. That article was completely based on the
>normal distribution and later became the basis for how SSDF computes
>the ELO's and the corresponding confidence levels.
>That was a long time ago and modern methods are more or less based on
>computer power.
>This time I had good help of my son David who opened my eyes for
>computer simulations as a method to find the analytical conclusion.
>Maybe more practical and with some interesting possibilities.
>...and fascinating!
>
>Regards
>//Peter



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.