Author: Peter Fendrich
Date: 07:27:37 01/02/03
Go up one level in this thread
On December 28, 2002 at 15:03:29, RĂ©mi Coulom wrote: >On December 27, 2002 at 09:41:17, Peter Fendrich wrote: >> >>What to do >>---------- >>I have a few suggestions that I would like to discuss: >> >>1) Better utilisation of computer time. If I have time for 20 games it's better >>to select 10 players and let A and B meat them respectively. >>The meaning of better will be better. > >My personal use of the statistical test is to measure whether a change in my >chess program is an improvement or not, in order to decide whether to keep it or >not. Self-play is certainly not accurate in evaluating the difference in playing >strength between two close versions of the same program. In particular, it tends >to overamplify the effect of small differences. But that is its main interest: >it acts as a magnifying glass to observe the effect of a small change in the >program. Yes, if you're using it between versions. I do the same but only to tell if it's worthwile to go on testing against other opponents. When testing against other opponents we have a new situation. As you know, many posters claim all sort of things just based on a match between two players... >I believe that, given a number of games to play, self-play is more >likely to give statistically significant results than playing against a pool of >opponents because of this amplification effect (this belief might be worth >testing, by the way). Of course, if you obtain statistically significant results >against 10 different players then it is certainly much more valuable. I have the same belief and did also some small tests to verify it a year ago. I assume however, that it depends on the program and the type of change. >Also, note that if you use 10 opponents, you will have 10 games by A and 10 >games by B, whereas self-play would have produced 20 games for each player, >which, I suppose, would make it easier to reach a better statistical >significance. >> >>2) Use some degree of better, for instance 60% (instead of 50%) as the lower >>limit. "A beats B with at least 60%" with a probability of x%. It's hard to tell >>anything about probability against the rest of the population but maybe some a >>priori distribution can be used. >> >>In both cases draws has to be counted because they are part of the question. >> >>Peter > >Yes, of course, that is a possibility. Unfortunately, the changes I usually make >to my chess program are so small that proving >50% probability of win is the >best I can hope, most of the time! Well, anything above 50% will do. I'm convinced that not using the draws is to lose quality in the conclusions due to loss of information. One possibility is to turn around the question: - What is the highest possible win-% by using a fixed probability (like 95%). If it's low we can't possibly know if it has any effect at all on the population. I think it would be possible to even find out where the limits are in general or for a specific program and to use it as a lower value. Peter
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.