Author: Richard Pijl
Date: 09:01:49 01/23/04
Go up one level in this thread
>> >>What is estimated above using statistical methods is the difference in ELO >>between Shredder 8 and Shredder 7.04. The difference is estimated to be +29, >>where the confidence interval (95%) of the difference is +1 - +58. This means >>that with the probability of 97.5% Shredder 8 is stronger by at least 1 ELO >>point. >>What do you not understand here? > >Richard, >you got it the wrong way around. Look if you speak of INNER confidence marge >there is nothing you can say while if you get a higher number for a difference >THEN you have a significant difference in strength. Excuse me for being firm in >what must be said. It's just stats. Nothing >what I have calculated or invented. Nothing personal between the two of us I >hope. > >Rolf > ok, let's approach from a different angle. The ELO system is based on rating differences. That means that when there is a certain rating difference between two progs, you can expect a certain outcome of a match between those progs. That expected outcome will have a confidence margin, of course. e.g. in a 100 game match program A is expected to score between 35 and 55 points. The reverse is of course also true. From a match result you can calculate a certain ELO difference. This difference has a certain reliability: The confidence interval. This is quite easy to do for matches between 2 programs. But when you have a tournament result, you have in fact many matches between two programs. What programs like ELOstat try to do here is to calculate the most probable list of ELO's (given a certain 'starting ELO'), and give a confidence interface each of the estimations. So the estimation of differences is now used to create an estimation of playing strength. The difference of rating between programs in this list is now based on a double estimation: The estimation of the position of one program, and the estimation of the position of the second program again, each with their own confidence interval. Where I our reasoning differs is that you treat the rating difference calculated by Munjong treat them like a rating difference from a list like mentioned above, where each program has its own confidence interval for the rating estimation. That is not correct here. The number that Munjong produced is an estimation for the difference directly. So you should draw the confidence interval around the difference. So from the match results you can conclude with 95% certainty that Shredder 8 is at least one rating point better. Now about the use of two confidence intervals: If you want to compare two programs you cannot just add up the confidence intervals. That's quite easy to see if you consider what happens with the chance that the estimation is wrong. Let's assume Program A is estimated at 2500, with a conf. interval (95%) of 2450-2550. Program B is estimated at 2600, with a conf. interval (95%) of 2550-2650 (exclusive bounds :-) ). The chance either estimation is wrong is 5%, 2.5% at each side. But one estimation being wrong beyond the conf. interval still doesn't mean that Program A is suddenly better. e.g. Program's A 'Real' strength could be 2555. But the chance is still quite big that program B is stronger than 2555. In fact, it is quite likely. So if you want to compare two programs from a list with maintaining the same confidence level (95%) you should treat them like independent chances. Using the same example that would mean (50^2 + 50^2) ^0.5 = 70.7, meaning that the rating difference between program A and B is 100 with a confidence interval (95%) +- 70.7 ( +30 - +170) Hope that makes things clearer. Richard.
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.