Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Some stats...

Author: Richard Pijl

Date: 09:01:49 01/23/04

Go up one level in this thread


>>
>>What is estimated above using statistical methods is the difference in ELO
>>between Shredder 8 and Shredder 7.04. The difference is estimated to be +29,
>>where the confidence interval (95%) of the difference is +1 - +58. This means
>>that with the probability of 97.5% Shredder 8 is stronger by at least 1 ELO
>>point.
>>What do you not understand here?
>
>Richard,
>you got it the wrong way around. Look if you speak of INNER confidence marge
>there is nothing you can say while if you get a higher number for a difference
>THEN you have a significant difference in strength. Excuse me for being firm in
>what must be said. It's just stats. Nothing
>what I have calculated or invented. Nothing personal between the two of us I
>hope.
>
>Rolf
>

ok, let's approach from a different angle.

The ELO system is based on rating differences. That means that when there is a
certain rating difference between two progs, you can expect a certain outcome of
a match between those progs. That expected outcome will have a confidence
margin, of course. e.g. in a 100 game match program A is expected to score
between 35 and 55 points.

The reverse is of course also true. From a match result you can calculate a
certain ELO difference. This difference has a certain reliability: The
confidence interval.

This is quite easy to do for matches between 2 programs. But when you have a
tournament result, you have in fact many matches between two programs. What
programs like ELOstat try to do here is to calculate the most probable list of
ELO's (given a certain 'starting ELO'), and give a confidence interface each of
the estimations. So the estimation of differences is now used to create an
estimation of playing strength.
The difference of rating between programs in this list is now based on a double
estimation: The estimation of the position of one program, and the estimation of
the position of the second program again, each with their own confidence
interval.

Where I our reasoning differs is that you treat the rating difference calculated
by Munjong treat them like a rating difference from a list like mentioned above,
where each program has its own confidence interval for the rating estimation.
That is not correct here. The number that Munjong produced is an estimation for
the difference directly. So you should draw the confidence interval around the
difference. So from the match results you can conclude with 95% certainty that
Shredder 8 is at least one rating point better.

Now about the use of two confidence intervals:
If you want to compare two programs you cannot just add up the confidence
intervals. That's quite easy to see if you consider what happens with the chance
that the estimation is wrong.
Let's assume Program A is estimated at 2500, with a conf. interval (95%) of
2450-2550. Program B is estimated at 2600, with a conf. interval (95%) of
2550-2650 (exclusive bounds :-) ).
The chance either estimation is wrong is 5%, 2.5% at each side.
But one estimation being wrong beyond the conf. interval still doesn't mean that
Program A is suddenly better. e.g. Program's A 'Real' strength could be 2555.
But the chance is still quite big that program B is stronger than 2555. In fact,
it is quite likely.
So if you want to compare two programs from a list with maintaining the same
confidence level (95%) you should treat them like independent chances. Using the
same example that would mean (50^2 + 50^2) ^0.5 = 70.7, meaning that the rating
difference between program A and B is 100 with a confidence interval (95%) +-
70.7 ( +30 - +170)

Hope that makes things clearer.
Richard.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.