Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: table for detecting significant difference between two engines

Author: Vasik Rajlich

Date: 03:30:37 02/05/06

Go up one level in this thread


On February 04, 2006 at 05:13:20, Joseph Ciarrochi wrote:

>>Thanks for the tables, however I don't think that they are appropriate for the
>>most common test scenario:
>>
>>1) You have some existing version of the program, let's call it Beta x.
>>2) You spend a week or so making changes, leading to a new version, let's call
>>it Beta (x+1). You hope that these changes give you 10 or 20 rating points, and
>>you pretty much know that they won't give you more than that.
>>3) You want to make sure that Beta (x+1) is not actually weaker than Beta x.
>>
>>In this case, if Beta (x+1) scores 7.5/10 against Beta x, you cannot be 95% sure
>>that it is not weaker.
>
>
>

Hi Joseph,

there is a lot to think about here. I will need to spend a few days after the
1.2 release thinking about this and coming up with a real testing procedure.
Please find my initial comments below:

>Hi vasik, always good to read your emails.
>
>yes, Vasik, I agree that a 5% cut-off is not sufficiently conservative and will
>artificially increase the number of false positives. I reckon you should use the
>.1 cut-off. I agree with your conclusions.
>

I don't think this is realistic. Some changes are small enough that you will
simply never be able to get 99.9% confidence.

Also, please note that 95% confidence is rather acceptable. This means taking 19
steps forward, 1 step back - not perfect, but close enough.

>The problem isn't really with the table.  The table correctly represents the
>odds of a particular result happening by chance, if you take ***one****sample of
>N.
>
>Now, the main issue here is not with the single comparison rate, but with
>familywise error rate. in the case of rybka, familywise error estimate might be
>based on the number betas +1 that are tested against old betas. so rybka has
>about 14 betas i think. Let's assume this entails 13 tests (e.g., beta +1 versus
>beta) (and i know you already understand what i say below, because i saw you
>describe exactly this statistical issue in a previous email)
>
>So using the formula i put down below, set alpha = .05, then the chance of
>making at least one false positive = 1 – (1-alpha)^C;= about 49% chance. So you
>probably will make at least one false conclusion, maybe more if you use .05
>

This is perfectly fine. Actually, there were just five attempts to improve the
playing level (Beta 9, 10, 11, 12, and 13b). The chance is 19/20 ^ 5 that 95%
confidence data would be correct regarding whether every single one of those
attempts did or did not succeed. The chance would be really low that more than
one mistake was made.

>Now when you are comparing different parameter configurations, you get into some
>big family wise error issues. 30 comparisons means you have about a 79% chance
>of making one or more errors.
>

This is also fine. Errors are inevitable. 5% is quite ok. Sure, the chance is
79% chance to make at least one error, but the chance is also <<5% to make
overall progress. This is the important thing.

>Other critical issues:
>
>**if you test using blitz controls, then the draw rate is lower and error rate
>higher than what I have utilized in this table (draw rate = .32). This table
>will not be sufficiently conservative for blitz (still, .1 cut-off is pretty
>conservative)

This is IMHO not a major issue.

>**There is another problem, not to due with error rate and false positives..
>This has to due with external validity. e.g., would your results with an N = 10
>generalize to another random sample of 10 games. Well, I think there is no way
>you can get a representative sample of opening positions with n=10. you probably
>need at least 50 and preferably a set like you have created (consisting of 260 i
>believe).
>

Also not IMHO a serious problem, unless you pick some really weird positions or
have some really big problem with your game-playing setup.

>
>So what are we to conclude from all this. Maybe no conclusions should be made
>until you've run 100 or 200 games,using a wide range of openings.  then your
>conclusion should be based on a conservative alpha of 1 or .1.???
>

Let me try to explain again what I see as the root of the "problem".

Let's say that I go and purposefully destroy the Rybka playing level, by for
example always returning 0 from the eval. Then, I start playing games, and the
new version is losing 3-0. This is already enough - I am now 100% sure that the
new version is weaker. As an intelligent tester, I combine the "raw statistics"
with my expectation of what is possible.

When I test a small change, I have the knowledge (which the raw statistics do
not take into account) that the difference in playing level is very small. This
is not reflected in the tables, and is the reason why I think the tables are not
accurate (given the assumption of a small change). To use my previous  example,
if I make a small change, and the new version is scoring 7.5/10, it's obvious
that this is a statistical fluctuation. There cannot be 95% confidence at that
point that the change was good, although that is what the table says.

Anyway, thanks for your work on this.

Vas

>What do you think?
>best
>Joseph
>
>
>
>
>
>
>
>
>
>
>
>On February 04, 2006 at 04:13:17, Vasik Rajlich wrote:
>
>>On February 03, 2006 at 19:26:43, Joseph Ciarrochi wrote:
>>
>>>Here is the stats table i promised Heinz and others who might be interested.
>>>
>>>
>>>
>>>
>>>
>>>
>>>Table: Percentage Scores needed to conclude one engine is likely to be better
>>>than the other in head to head competetion
>>>
>>>		  Cut-off (alpha)
>>>Number of games	5%	1%	.1%
>>>10	        75	85	95
>>>20	        67.5	75	80
>>>30	        63.3	70	73.3
>>>40	        62.5	66.3	71.3
>>>50	        61	65	68
>>>75	        58.6	61.3	66
>>>100	        57	60	63
>>>150	        55.7	58.3	60
>>>200	        54.8	57	59.8
>>>300	        54.2	55.8	57.5
>>>500	        53.1	54.3	55.3
>>>1000	        52.2	53.1	54.1
>>>
>>>Notes:
>>>•	Based on 10000 randomly chosen samples. Thus, these values are approximate,
>>>though with such a large sample, the values should be close to the “true” value.
>>>•	Alpha represents the percentage of time that the score occurred by chance.
>>>(i.e., occurred, even though we know the true value to be .50, or 50%). Alpha is
>>>basically the odds of incorrectly saying two engines differ in head to head
>>>competition.
>>>•	Traditionally, .05 alpha is used as a cut-off, but I think this is a bit too
>>>lenient. I would recommend  1% or .1%, to be reasonably confident
>>>•	Draw rate assumed to be .32 (based on CEGT 40/40 draw rates). Variations in
>>>draw rate will slightly effect cut-off levels, but i don't think the difference
>>>will be big.
>>>•	Engines assumed to play equal numbers of games as white and black
>>>•	In cases where a particular score fell both above and below the cutoff, then
>>>the next score above the cutoff  was chosen. This leads to conservative
>>>estimates. (e.g., for n of 10, a score of 7 occurred above and below the 5%
>>>cutoff. Therefore , 7.5 became the cut-off)
>>>•	Type 1 error = saying an engine is better in head to head competition, when
>>>there is actually no difference. The chance of making a type 1 error increases
>>>with the number of comparisons you make.  If you conduct C comparisons, the odds
>>>of making at least one type 1 error = 1 – (1-alpha)^C. (^ = raised to the power
>>>of C).
>>>•	 It is critical that you choose your sample size ahead of time, and do not
>>>make any conclusions until you have run the full tournament. It is incorrect,
>>>statistically, to watch the running of the tournament,  wait until an engine
>>>reaches a cut-off, and then stop the tournament.
>>
>>Thanks for the tables, however I don't think that they are appropriate for the
>>most common test scenario:
>>
>>1) You have some existing version of the program, let's call it Beta x.
>>2) You spend a week or so making changes, leading to a new version, let's call
>>it Beta (x+1). You hope that these changes give you 10 or 20 rating points, and
>>you pretty much know that they won't give you more than that.
>>3) You want to make sure that Beta (x+1) is not actually weaker than Beta x.
>>
>>In this case, if Beta (x+1) scores 7.5/10 against Beta x, you cannot be 95% sure
>>that it is not weaker.
>>
>>As I think of it, the reason is that you have the additional information about
>>the upper bound on the strength difference between the two versions.
>>
>>Note also that the second most common test scenario - that of trying to tune
>>some search parameter - also has this property.
>>
>>If somebody could work through the math in these two above scenarios, it would
>>be very interesting.
>>
>>It might also be that I just miss something here.
>>
>>Vas



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.