Author: Joseph Ciarrochi
Date: 02:53:37 02/04/06
Go up one level in this thread
Sounds good hienz. I think Vasik raises some good points and suggests that i should add a couple of more notes to make sure people don't misuse the table. In addition to the other notes, I would add the following: ** The values in the Table assume that you are testing a directional hypothesis, e.g., that engine A does better than B. If you have no idea of which engine might be better, then your hypothesis is non-directional and you must double the alpha rate. This means that if you select the .05 criteria, and you have a non-directional hypothesis, you are in fact using a .1 criteria, and if you choose the .01 criteria, you are using the .02 criteria. I recommend using at least the .01 criteria in these instances, and preferabbly using the .1 criterio. ** Even if you get a significant result, the result may not generalize well to future tests. One important question is: to what extent are the openings you used in your test representative of the openings the engine would actually use when playing. I think there is no way you can get a representative sample of opening positions with only, say, ten openings. You probably need at least 50 different openings. If you are going to use a particular opening book with an engine, it would be ideal to sample a fair number of different openings from this opening book. On February 04, 2006 at 04:58:46, Heinz van Kempen wrote: >On February 03, 2006 at 19:26:43, Joseph Ciarrochi wrote: > >>Here is the stats table i promised Heinz and others who might be interested. >> >> >> >> >> >> >>Table: Percentage Scores needed to conclude one engine is likely to be better >>than the other in head to head competetion >> >> Cut-off (alpha) >>Number of games 5% 1% .1% >>10 75 85 95 >>20 67.5 75 80 >>30 63.3 70 73.3 >>40 62.5 66.3 71.3 >>50 61 65 68 >>75 58.6 61.3 66 >>100 57 60 63 >>150 55.7 58.3 60 >>200 54.8 57 59.8 >>300 54.2 55.8 57.5 >>500 53.1 54.3 55.3 >>1000 52.2 53.1 54.1 >> >>Notes: >>• Based on 10000 randomly chosen samples. Thus, these values are approximate, >>though with such a large sample, the values should be close to the “true” value. >>• Alpha represents the percentage of time that the score occurred by chance. >>(i.e., occurred, even though we know the true value to be .50, or 50%). Alpha is >>basically the odds of incorrectly saying two engines differ in head to head >>competition. >>• Traditionally, .05 alpha is used as a cut-off, but I think this is a bit too >>lenient. I would recommend 1% or .1%, to be reasonably confident >>• Draw rate assumed to be .32 (based on CEGT 40/40 draw rates). Variations in >>draw rate will slightly effect cut-off levels, but i don't think the difference >>will be big. >>• Engines assumed to play equal numbers of games as white and black >>• In cases where a particular score fell both above and below the cutoff, then >>the next score above the cutoff was chosen. This leads to conservative >>estimates. (e.g., for n of 10, a score of 7 occurred above and below the 5% >>cutoff. Therefore , 7.5 became the cut-off) >>• Type 1 error = saying an engine is better in head to head competition, when >>there is actually no difference. The chance of making a type 1 error increases >>with the number of comparisons you make. If you conduct C comparisons, the odds >>of making at least one type 1 error = 1 – (1-alpha)^C. (^ = raised to the power >>of C). >>• It is critical that you choose your sample size ahead of time, and do not >>make any conclusions until you have run the full tournament. It is incorrect, >>statistically, to watch the running of the tournament, wait until an engine >>reaches a cut-off, and then stop the tournament. > >Hi Joseph, > >thanks for your work and your interesting table. We will put it on CEGT website >und ratings and comments. > >Keep up the good work >Heinz
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.