Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: table for detecting significant difference between two engines

Author: Joseph Ciarrochi

Date: 02:13:20 02/04/06

Go up one level in this thread


>Thanks for the tables, however I don't think that they are appropriate for the
>most common test scenario:
>
>1) You have some existing version of the program, let's call it Beta x.
>2) You spend a week or so making changes, leading to a new version, let's call
>it Beta (x+1). You hope that these changes give you 10 or 20 rating points, and
>you pretty much know that they won't give you more than that.
>3) You want to make sure that Beta (x+1) is not actually weaker than Beta x.
>
>In this case, if Beta (x+1) scores 7.5/10 against Beta x, you cannot be 95% sure
>that it is not weaker.



Hi vasik, always good to read your emails.

yes, Vasik, I agree that a 5% cut-off is not sufficiently conservative and will
artificially increase the number of false positives. I reckon you should use the
.1 cut-off. I agree with your conclusions.

The problem isn't really with the table.  The table correctly represents the
odds of a particular result happening by chance, if you take ***one****sample of
N.

Now, the main issue here is not with the single comparison rate, but with
familywise error rate. in the case of rybka, familywise error estimate might be
based on the number betas +1 that are tested against old betas. so rybka has
about 14 betas i think. Let's assume this entails 13 tests (e.g., beta +1 versus
beta) (and i know you already understand what i say below, because i saw you
describe exactly this statistical issue in a previous email)

So using the formula i put down below, set alpha = .05, then the chance of
making at least one false positive = 1 – (1-alpha)^C;= about 49% chance. So you
probably will make at least one false conclusion, maybe more if you use .05

Now when you are comparing different parameter configurations, you get into some
big family wise error issues. 30 comparisons means you have about a 79% chance
of making one or more errors.

Other critical issues:

**if you test using blitz controls, then the draw rate is lower and error rate
higher than what I have utilized in this table (draw rate = .32). This table
will not be sufficiently conservative for blitz (still, .1 cut-off is pretty
conservative)
**There is another problem, not to due with error rate and false positives..
This has to due with external validity. e.g., would your results with an N = 10
generalize to another random sample of 10 games. Well, I think there is no way
you can get a representative sample of opening positions with n=10. you probably
need at least 50 and preferably a set like you have created (consisting of 260 i
believe).


So what are we to conclude from all this. Maybe no conclusions should be made
until you've run 100 or 200 games,using a wide range of openings.  then your
conclusion should be based on a conservative alpha of 1 or .1.???

What do you think?
best
Joseph











On February 04, 2006 at 04:13:17, Vasik Rajlich wrote:

>On February 03, 2006 at 19:26:43, Joseph Ciarrochi wrote:
>
>>Here is the stats table i promised Heinz and others who might be interested.
>>
>>
>>
>>
>>
>>
>>Table: Percentage Scores needed to conclude one engine is likely to be better
>>than the other in head to head competetion
>>
>>		  Cut-off (alpha)
>>Number of games	5%	1%	.1%
>>10	        75	85	95
>>20	        67.5	75	80
>>30	        63.3	70	73.3
>>40	        62.5	66.3	71.3
>>50	        61	65	68
>>75	        58.6	61.3	66
>>100	        57	60	63
>>150	        55.7	58.3	60
>>200	        54.8	57	59.8
>>300	        54.2	55.8	57.5
>>500	        53.1	54.3	55.3
>>1000	        52.2	53.1	54.1
>>
>>Notes:
>>•	Based on 10000 randomly chosen samples. Thus, these values are approximate,
>>though with such a large sample, the values should be close to the “true” value.
>>•	Alpha represents the percentage of time that the score occurred by chance.
>>(i.e., occurred, even though we know the true value to be .50, or 50%). Alpha is
>>basically the odds of incorrectly saying two engines differ in head to head
>>competition.
>>•	Traditionally, .05 alpha is used as a cut-off, but I think this is a bit too
>>lenient. I would recommend  1% or .1%, to be reasonably confident
>>•	Draw rate assumed to be .32 (based on CEGT 40/40 draw rates). Variations in
>>draw rate will slightly effect cut-off levels, but i don't think the difference
>>will be big.
>>•	Engines assumed to play equal numbers of games as white and black
>>•	In cases where a particular score fell both above and below the cutoff, then
>>the next score above the cutoff  was chosen. This leads to conservative
>>estimates. (e.g., for n of 10, a score of 7 occurred above and below the 5%
>>cutoff. Therefore , 7.5 became the cut-off)
>>•	Type 1 error = saying an engine is better in head to head competition, when
>>there is actually no difference. The chance of making a type 1 error increases
>>with the number of comparisons you make.  If you conduct C comparisons, the odds
>>of making at least one type 1 error = 1 – (1-alpha)^C. (^ = raised to the power
>>of C).
>>•	 It is critical that you choose your sample size ahead of time, and do not
>>make any conclusions until you have run the full tournament. It is incorrect,
>>statistically, to watch the running of the tournament,  wait until an engine
>>reaches a cut-off, and then stop the tournament.
>
>Thanks for the tables, however I don't think that they are appropriate for the
>most common test scenario:
>
>1) You have some existing version of the program, let's call it Beta x.
>2) You spend a week or so making changes, leading to a new version, let's call
>it Beta (x+1). You hope that these changes give you 10 or 20 rating points, and
>you pretty much know that they won't give you more than that.
>3) You want to make sure that Beta (x+1) is not actually weaker than Beta x.
>
>In this case, if Beta (x+1) scores 7.5/10 against Beta x, you cannot be 95% sure
>that it is not weaker.
>
>As I think of it, the reason is that you have the additional information about
>the upper bound on the strength difference between the two versions.
>
>Note also that the second most common test scenario - that of trying to tune
>some search parameter - also has this property.
>
>If somebody could work through the math in these two above scenarios, it would
>be very interesting.
>
>It might also be that I just miss something here.
>
>Vas



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.