Computer Chess Club Archives


Search

Terms

Messages

Subject: revised statistics table;Detecting differences in head to head competiti

Author: Joseph Ciarrochi

Date: 21:03:38 02/04/06


in light of the interest in the tables, i decided to redo all analyis,
increasing the number of samples from 10000 to 50000 and therby increasing the
precision of the estimates. I have inserted the table and notes at the bottom of
this email (there are only tiny differences between the tables based on  10000
versus 50000 samples).


The table is based on a draw rate of .32, which is what top engines and humans
tend to get in slower games. however, the draw rates between more average humans
playing blitz, and with some engines playing blitz, tends to be about .12 (based
on Internet chess club statistics).  I recalculated some values using this lower
draw rate and obtained the following


		Cut-off (alpha)
Number of games	5%	1%	.1%
10	       80	90	100
50	      62	66	71
100	      58	61.5	65


Comparing this to the table below, the critical values tend to be higher when
you have a lower draw rate, especially for smaller numbers of games. Lower draw
rate means greater variability in scores and therefore a greater occurance of
extreme scores.






________________________redone tables___________________________________


Percentage Scores needed to conclude one engine is likely to be better than the
other

		Cut-off (alpha)
Number of games	5%	1%	.1%
10	         75	85	95
20	        67.5	72.5	80
30	        63.3	68.3	75
40	        62.5	66.3	71.3
50	        60	65	69
75	       58.6	61.3	65.3
100	       57.5	60	63.5
150	       56	58	60.7
200	        55	57	59.3
300	        54	55.7	57.5
500	        53.1	54.4	55.8
1000	        52.2	53.1	54.1



•  Based on 50000 randomly chosen samples. Thus, these values are approximate,
though with such a large sample, the values should be close to the “true” value.
•  Alpha represents the percentage of time that the score occurred by chance.
(i.e., occurred, even though we know the true value to be .50, or 50%). Alpha is
basically the odds of incorrectly saying two engines differ in head to head
competition.
•  Traditionally, .05 alpha is used as a cut-off, but I think this is a bit too
lenient. I would recommend 1% or .1%, to be reasonably confident
•  Draw rate assumed to be .32 (based on CEGT 40/40 draw rates). Variations in
draw rate will slightly effect cut-off levels, but i don't think the difference
will be big.
•  Engines assumed to play equal numbers of games as white and black
•  In cases where a particular score fell both above and below the cutoff, then
the next score above the cutoff was chosen. This leads to conservative
estimates. (e.g., for n of 10, a score of 7 occurred above and below the 5%
cutoff. Therefore , 7.5 became the cut-off)
•  Type 1 error = saying an engine is better in head to head competition, when
there is actually no difference. The chance of making a type 1 error increases
with the number of comparisons you make. If you conduct C comparisons, the odds
of making at least one type 1 error = 1 – (1-alpha)^C. (^ = raised to the power
of C).
•  It is critical that you choose your sample size ahead of time, and do not
make any conclusions until you have run the full tournament. It is incorrect,
statistically, to watch the running of the tournament, wait until an engine
reaches a cut-off, and then stop the tournament.
•  The values in the Table assume that you are testing a directional hypothesis,
e.g., that engine A does better than B. If you have no idea of which engine
might be better, then your hypothesis is non-directional and you must double the
alpha rate. This means that if you select the .05 criteria, and you have a
non-directional hypothesis, you are in fact using a .1 criteria, and if you
choose the .01 criteria, you are using the .02 criteria. I recommend using at
least the .01 criteria in these instances, and preferabbly using the .1
criterio.
•  Even if you get a significant result, the result may not generalize well to
future tests. One important question is: to what extent are the openings you
used in your test representative of the openings the engine would actually use
when playing. I think there is no way you can get a representative sample of
opening positions with only, say, ten openings. You probably need at least 50
different openings. If you are going to use a particular opening book with an
engine, it would be ideal to sample a fair number of different openings from
this opening book.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.