Author: Drexel,Michael
Date: 11:39:12 09/01/05
Go up one level in this thread
On September 01, 2005 at 13:52:09, Peter Berger wrote: >On August 31, 2005 at 08:53:56, Vasik Rajlich wrote: > >>On August 31, 2005 at 06:22:49, Peter Berger wrote: >> >>>On August 31, 2005 at 04:52:11, Vasik Rajlich wrote: >>> >>>>On August 30, 2005 at 12:27:52, Peter Berger wrote: >>>> >>>>>On August 30, 2005 at 12:21:20, Maurizio De Leo wrote: >>>>> >>>>>> >>>>>>>Under valid and controlled conditions it still seems logical to me to stop a >>>>>>>test after a 5-0 result and conclude that the winning program is probably the >>>>>>>stronger one. >>>>>> >>>>>>>>I don't put much credence in any result of less than 30 games. >>>>>>>>After 30 games, then you get a lot more plausibility. >>>>>> >>>>>>>You didn't give any reason for this, so I don't understand. A 6-0 says more >>>>>>>about engine strength than the above match result with over 100000 games. >>>>>> >>>>>>Dann is right, I think. >>>>>>The confidence interval calculation assumes that the score of a game is a >>>>>>statistic variable with a mean value between 1 and -1 (function of the Elo >>>>>>difference between the programs) and a standard deviation. Then if the >>>>>>experiments are independent, the sum of the points will approximate the product >>>>>>(mean*number of games) with a smaller standard deviation the more the games are. >>>>>>With enough games the "confidence" will get to 95% when the performance >>>>>>difference between the two programs is more than 3 standard deviations. >>>>>>However this assumes a normal distribution. The assumption can be made for any >>>>>>repeated statistical variable as long as the experiments are independent and >>>>>>"enough". This "enough" is indeed expressed in most statistics books as 30. >>>>>> >>>>>>Maurizio >>>>> >>>>>Please have a look at "WhoisBest.zip" at Rémi Coulom's Home Page: >>>>>http://remi.coulom.free.fr/. It includes a little paper Whoisbest.pdf on >>>>>"Statistical Significance of a Match" , with a very straightforward mathematical >>>>>proof that for example the number of draws is irrelevant to conclude who is >>>>>better in a chessmatch . >>>>> >>>>>Peter >>>> >>>>It's not that simple, due to the nature of chess. >>>> >>>>In chess, a match result of 2-0 with 0 draws is less significant than a match >>>>result of 2-0 with 8 draws. >>>> >>>>WhoIsBest makes the assumption that draws are independent events - that is, that >>>>wins, losses and draws each come with some independent probability. In fact, in >>>>a +2 -0 =8 result, the chance is that the side with the +2 was "stronger" in the >>>>draws - ie. closer to winning. Chess has this phenomenon where the stronger side >>>>tries to break through the draw barrier, and sometimes cannot. >>>> >>>>Of course to model this mathematically would be a huge mess. >>>> >>>>Vas >>> >>>No, that's a misunderstanding. >>> >>>The only assumption that is made is that the results get drawn independently >>>from an unknown probability distribution. >>> >>>So it doesn't matter *at all* how drawish chess itself is e.g. . And the result >>>will be the same whether the game is tic-tac-toe, checkers or chess. >>> >>>Unless you want to argue that there should be a distinction between drawn games, >>>depending on how close one side got to winning. But that's a completely >>>different topic. >>> >>>Peter >> >>Ok - consider the following scenario: >> >>Two players are playing basketball. The stronger player has some >50% chance to >>score each basket. The game ends when one player scores 50 points. Once the game >>is finished, a win by a margin of under 25 points is declared a draw, while a >>win by >25 points is declared a win. >> >>The question is: in this case, is a 2-0 result with 8 draws more significant >>than 2-0 with 0 draws? >> >>Vas > >No, it isn't more significant on answering the question who is the better >player. > >Peter The whole discussion is completely irrelevant anyway. The strength of a chess player can certainly not be determined by playing only one opponent. Suppose you play a 100000 games match between two engines and the winner scores 55%. Still you can´t conclude it is the stronger engine if we apply common definition of "Strength". Michael
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.