Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: table for detecting significant difference between two engines

Author: Uri Blass

Date: 17:01:02 02/05/06

Go up one level in this thread


On February 05, 2006 at 06:30:37, Vasik Rajlich wrote:

>On February 04, 2006 at 05:13:20, Joseph Ciarrochi wrote:
>
>>>Thanks for the tables, however I don't think that they are appropriate for the
>>>most common test scenario:
>>>
>>>1) You have some existing version of the program, let's call it Beta x.
>>>2) You spend a week or so making changes, leading to a new version, let's call
>>>it Beta (x+1). You hope that these changes give you 10 or 20 rating points, and
>>>you pretty much know that they won't give you more than that.
>>>3) You want to make sure that Beta (x+1) is not actually weaker than Beta x.
>>>
>>>In this case, if Beta (x+1) scores 7.5/10 against Beta x, you cannot be 95% sure
>>>that it is not weaker.
>>
>>
>>
>
>Hi Joseph,
>
>there is a lot to think about here. I will need to spend a few days after the
>1.2 release thinking about this and coming up with a real testing procedure.
>Please find my initial comments below:
>
>>Hi vasik, always good to read your emails.
>>
>>yes, Vasik, I agree that a 5% cut-off is not sufficiently conservative and will
>>artificially increase the number of false positives. I reckon you should use the
>>.1 cut-off. I agree with your conclusions.
>>
>
>I don't think this is realistic. Some changes are small enough that you will
>simply never be able to get 99.9% confidence.
>
>Also, please note that 95% confidence is rather acceptable. This means taking 19
>steps forward, 1 step back - not perfect, but close enough.
>
>>The problem isn't really with the table.  The table correctly represents the
>>odds of a particular result happening by chance, if you take ***one****sample of
>>N.
>>
>>Now, the main issue here is not with the single comparison rate, but with
>>familywise error rate. in the case of rybka, familywise error estimate might be
>>based on the number betas +1 that are tested against old betas. so rybka has
>>about 14 betas i think. Let's assume this entails 13 tests (e.g., beta +1 versus
>>beta) (and i know you already understand what i say below, because i saw you
>>describe exactly this statistical issue in a previous email)
>>
>>So using the formula i put down below, set alpha = .05, then the chance of
>>making at least one false positive = 1 – (1-alpha)^C;= about 49% chance. So you
>>probably will make at least one false conclusion, maybe more if you use .05
>>
>
>This is perfectly fine. Actually, there were just five attempts to improve the
>playing level (Beta 9, 10, 11, 12, and 13b). The chance is 19/20 ^ 5 that 95%
>confidence data would be correct regarding whether every single one of those
>attempts did or did not succeed. The chance would be really low that more than
>one mistake was made.
>
>>Now when you are comparing different parameter configurations, you get into some
>>big family wise error issues. 30 comparisons means you have about a 79% chance
>>of making one or more errors.
>>
>
>This is also fine. Errors are inevitable. 5% is quite ok. Sure, the chance is
>79% chance to make at least one error, but the chance is also <<5% to make
>overall progress. This is the important thing.
>
>>Other critical issues:
>>
>>**if you test using blitz controls, then the draw rate is lower and error rate
>>higher than what I have utilized in this table (draw rate = .32). This table
>>will not be sufficiently conservative for blitz (still, .1 cut-off is pretty
>>conservative)
>
>This is IMHO not a major issue.
>
>>**There is another problem, not to due with error rate and false positives..
>>This has to due with external validity. e.g., would your results with an N = 10
>>generalize to another random sample of 10 games. Well, I think there is no way
>>you can get a representative sample of opening positions with n=10. you probably
>>need at least 50 and preferably a set like you have created (consisting of 260 i
>>believe).
>>
>
>Also not IMHO a serious problem, unless you pick some really weird positions or
>have some really big problem with your game-playing setup.
>
>>
>>So what are we to conclude from all this. Maybe no conclusions should be made
>>until you've run 100 or 200 games,using a wide range of openings.  then your
>>conclusion should be based on a conservative alpha of 1 or .1.???
>>
>
>Let me try to explain again what I see as the root of the "problem".
>
>Let's say that I go and purposefully destroy the Rybka playing level, by for
>example always returning 0 from the eval. Then, I start playing games, and the
>new version is losing 3-0. This is already enough - I am now 100% sure that the
>new version is weaker. As an intelligent tester, I combine the "raw statistics"
>with my expectation of what is possible.
>
>When I test a small change, I have the knowledge (which the raw statistics do
>not take into account) that the difference in playing level is very small. This
>is not reflected in the tables, and is the reason why I think the tables are not
>accurate (given the assumption of a small change). To use my previous  example,
>if I make a small change, and the new version is scoring 7.5/10, it's obvious
>that this is a statistical fluctuation. There cannot be 95% confidence at that
>point that the change was good, although that is what the table says.
>
>Anyway, thanks for your work on this.
>
>Vas

It is obvious that 75% is better than the real result in this case.
I agree that if you know that the change is small enough and the expected
average is 0 no result can be significant after 10 games but life is not always
like that.

Imagine that you made a change that made the program slightly faster in test
positions thanks to better order of moves.

You have a good reason to believe that the average is positive.
You cannot be sure about it because you may have a bug but you can decide that
you believe in aprior distribution of changes that give you 90% confidence that
the program is better even with no games.

Now you need less games to be sure with 95% confidence relative to the case that
you assume that you have no idea if the change is negative or positive.

Note that deciding about fixed 95% confidence is probably a mistake and it is
more important not to do a mistake when the expected difference is bigger so if
you do some changes when in one of them the expected change is bigger, then you
can decide that you need bigger confidence that the change is positive in the
case that the expected change is bigger.

Uri



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.