Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: revised statistics table; and the Stats don't lie.

Author: chandler yergin

Date: 01:54:46 02/06/06

Go up one level in this thread


On February 05, 2006 at 00:03:38, Joseph Ciarrochi wrote:

>in light of the interest in the tables, i decided to redo all analyis,
>increasing the number of samples from 10000 to 50000 and therby increasing the
>precision of the estimates. I have inserted the table and notes at the bottom of
>this email (there are only tiny differences between the tables based on  10000
>versus 50000 samples).
>
>
>The table is based on a draw rate of .32, which is what top engines and humans
>tend to get in slower games. however, the draw rates between more average humans
>playing blitz, and with some engines playing blitz, tends to be about .12 (based
>on Internet chess club statistics).  I recalculated some values using this lower
>draw rate and obtained the following
>
>
>		Cut-off (alpha)
>Number of games	5%	1%	.1%
>10	       80	90	100
>50	      62	66	71
>100	      58	61.5	65
>
>
>Comparing this to the table below, the critical values tend to be higher when
>you have a lower draw rate, especially for smaller numbers of games. Lower draw
>rate means greater variability in scores and therefore a greater occurance of
>extreme scores.
>
>
>
>
>
>
>________________________redone tables___________________________________
>
>
>Percentage Scores needed to conclude one engine is likely to be better than the
>other
>
>		Cut-off (alpha)
>Number of games	5%	1%	.1%
>10	         75	85	95
>20	        67.5	72.5	80
>30	        63.3	68.3	75
>40	        62.5	66.3	71.3
>50	        60	65	69
>75	       58.6	61.3	65.3
>100	       57.5	60	63.5
>150	       56	58	60.7
>200	        55	57	59.3
>300	        54	55.7	57.5
>500	        53.1	54.4	55.8
>1000	        52.2	53.1	54.1
>
>
>
>•  Based on 50000 randomly chosen samples. Thus, these values are approximate,
>though with such a large sample, the values should be close to the “true” value.
>•  Alpha represents the percentage of time that the score occurred by chance.
>(i.e., occurred, even though we know the true value to be .50, or 50%). Alpha is
>basically the odds of incorrectly saying two engines differ in head to head
>competition.
>•  Traditionally, .05 alpha is used as a cut-off, but I think this is a bit too
>lenient. I would recommend 1% or .1%, to be reasonably confident
>•  Draw rate assumed to be .32 (based on CEGT 40/40 draw rates). Variations in
>draw rate will slightly effect cut-off levels, but i don't think the difference
>will be big.
>•  Engines assumed to play equal numbers of games as white and black
>•  In cases where a particular score fell both above and below the cutoff, then
>the next score above the cutoff was chosen. This leads to conservative
>estimates. (e.g., for n of 10, a score of 7 occurred above and below the 5%
>cutoff. Therefore , 7.5 became the cut-off)
>•  Type 1 error = saying an engine is better in head to head competition, when
>there is actually no difference. The chance of making a type 1 error increases
>with the number of comparisons you make. If you conduct C comparisons, the odds
>of making at least one type 1 error = 1 – (1-alpha)^C. (^ = raised to the power
>of C).
>•  It is critical that you choose your sample size ahead of time, and do not
>make any conclusions until you have run the full tournament. It is incorrect,
>statistically, to watch the running of the tournament, wait until an engine
>reaches a cut-off, and then stop the tournament.
>•  The values in the Table assume that you are testing a directional hypothesis,
>e.g., that engine A does better than B. If you have no idea of which engine
>might be better, then your hypothesis is non-directional and you must double the
>alpha rate. This means that if you select the .05 criteria, and you have a
>non-directional hypothesis, you are in fact using a .1 criteria, and if you
>choose the .01 criteria, you are using the .02 criteria. I recommend using at
>least the .01 criteria in these instances, and preferabbly using the .1
>criterio.

>•  Even if you get a significant result, the result may not generalize well to
>future tests. One important question is: to what extent are the openings you
>used in your test representative of the openings the engine would actually use
>when playing. I think there is no way you can get a representative sample of
>opening positions with only, say, ten openings. You probably need at least 50
>different openings. If you are going to use a particular opening book with an
>engine, it would be ideal to sample a fair number of different openings from
>this opening book.

 Very Perceptive. Thanks for Posting this, it helps confirm what I have
been saying, in awkward ways, no doubt, but there IS a Direct Correlation
between Wins, Loses, Draws, with ECO Classification.
This is true for Humans vs Humans, Humans vs Comps, Comps vs Comps.
All of the Testing, which is a small fraction of total games played since
the recording of this History, confirms it. You are right on point.
A point that everyone wants to avoid. For obvious reasons... ;)
It is irrefutable. I think Steinitz said it best:
"Chess is a scientific game and its literature ought to be placed on the basis
of the strictest truthfulness, which is the foundation of all scientific
research." — W._Steinitz

In practical terms we have reached the limit of Chess Theory.
White has the significant advantage of the 1st move.
We know what Lines ..i.e. ECO Classifications are best for both sides,
and which ECO Classifications are Drawish.




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.