Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: table for detecting significant difference between two engines

Author: Joseph Ciarrochi

Date: 04:39:23 02/06/06

Go up one level in this thread


Yes, this are good points by both you, vasik, and uri. Certainly the tables are
not constrained by any a-priori knowledge. If you do have such knowledge, than
this should be used, by all means. As uri points out, you can increase power if
you have knowledge that leads you to make a directional hypothesis (which is
what the tables are based on , btw). When i am doing statistical analysis in
science, I use as much theory as i can to constrain the models.

In one example, you say that you know you have made a minor change to an engine
and you know a-priori that the change  in strength is very small, than you would
not assume any big change. If you observed a big change (you said 7.5/10), then
you would not trust it.

So it sounds like you are making a critical assumption, namely,  given the two
engines are very similar, they will tend to obtain very similar scores against
each other. Draw rate would be presumably higher and variance lower. You would
not use the table I generated, because it assumes an avarage draw rate and
variability (obtained when different engines play each other). You would need a
new table that is based on your error rate, which, being lower, would mean that
you need smaller values to detect a difference.


In your particular case (testing beta x versus beta x +1), I think you would
want to generate a table that gives you the odds of particular scores, when the
engine is playing itself. This will have a much higher draw rate, lower
variability, and values like 7.5/10 will be rarer. (btw, i posted some results
where draw rate was very low, as in the case of human blitz, and the cut-offs
become more stringent).

Ok, now let's say you do obtain a score of 7.5/10, and you have strong
theoretical reasons to expect that this is inflated. That's ok. you can run
another 100 games or so and get closer to the true value. But this does not mean
that 7.5 is meaningless.

let's say that the odds of Beta X+1  winning against beta x is = T+ E, where T =
true strength advantage over Beta X , and E= error.

The bigger T is, the more likely you are to obtain 7.5/10 or higher, even if
this value is inflated.  in contrast, if T is 0, then 7.5 is less likely to
occur. So from a statistical point of view, obtaining a 7.5 means that T is less
likely to be 0 (compared to the case when you obtain a 5/10). So the result is
informative, even if the estimate, 7.5, is a bit high.




best
Joseph
















On February 05, 2006 at 20:01:02, Uri Blass wrote:

>On February 05, 2006 at 06:30:37, Vasik Rajlich wrote:
>
>>On February 04, 2006 at 05:13:20, Joseph Ciarrochi wrote:
>>
>>>>Thanks for the tables, however I don't think that they are appropriate for the
>>>>most common test scenario:
>>>>
>>>>1) You have some existing version of the program, let's call it Beta x.
>>>>2) You spend a week or so making changes, leading to a new version, let's call
>>>>it Beta (x+1). You hope that these changes give you 10 or 20 rating points, and
>>>>you pretty much know that they won't give you more than that.
>>>>3) You want to make sure that Beta (x+1) is not actually weaker than Beta x.
>>>>
>>>>In this case, if Beta (x+1) scores 7.5/10 against Beta x, you cannot be 95% sure
>>>>that it is not weaker.
>>>
>>>
>>>
>>
>>Hi Joseph,
>>
>>there is a lot to think about here. I will need to spend a few days after the
>>1.2 release thinking about this and coming up with a real testing procedure.
>>Please find my initial comments below:
>>
>>>Hi vasik, always good to read your emails.
>>>
>>>yes, Vasik, I agree that a 5% cut-off is not sufficiently conservative and will
>>>artificially increase the number of false positives. I reckon you should use the
>>>.1 cut-off. I agree with your conclusions.
>>>
>>
>>I don't think this is realistic. Some changes are small enough that you will
>>simply never be able to get 99.9% confidence.
>>
>>Also, please note that 95% confidence is rather acceptable. This means taking 19
>>steps forward, 1 step back - not perfect, but close enough.
>>
>>>The problem isn't really with the table.  The table correctly represents the
>>>odds of a particular result happening by chance, if you take ***one****sample of
>>>N.
>>>
>>>Now, the main issue here is not with the single comparison rate, but with
>>>familywise error rate. in the case of rybka, familywise error estimate might be
>>>based on the number betas +1 that are tested against old betas. so rybka has
>>>about 14 betas i think. Let's assume this entails 13 tests (e.g., beta +1 versus
>>>beta) (and i know you already understand what i say below, because i saw you
>>>describe exactly this statistical issue in a previous email)
>>>
>>>So using the formula i put down below, set alpha = .05, then the chance of
>>>making at least one false positive = 1 – (1-alpha)^C;= about 49% chance. So you
>>>probably will make at least one false conclusion, maybe more if you use .05
>>>
>>
>>This is perfectly fine. Actually, there were just five attempts to improve the
>>playing level (Beta 9, 10, 11, 12, and 13b). The chance is 19/20 ^ 5 that 95%
>>confidence data would be correct regarding whether every single one of those
>>attempts did or did not succeed. The chance would be really low that more than
>>one mistake was made.
>>
>>>Now when you are comparing different parameter configurations, you get into some
>>>big family wise error issues. 30 comparisons means you have about a 79% chance
>>>of making one or more errors.
>>>
>>
>>This is also fine. Errors are inevitable. 5% is quite ok. Sure, the chance is
>>79% chance to make at least one error, but the chance is also <<5% to make
>>overall progress. This is the important thing.
>>
>>>Other critical issues:
>>>
>>>**if you test using blitz controls, then the draw rate is lower and error rate
>>>higher than what I have utilized in this table (draw rate = .32). This table
>>>will not be sufficiently conservative for blitz (still, .1 cut-off is pretty
>>>conservative)
>>
>>This is IMHO not a major issue.
>>
>>>**There is another problem, not to due with error rate and false positives..
>>>This has to due with external validity. e.g., would your results with an N = 10
>>>generalize to another random sample of 10 games. Well, I think there is no way
>>>you can get a representative sample of opening positions with n=10. you probably
>>>need at least 50 and preferably a set like you have created (consisting of 260 i
>>>believe).
>>>
>>
>>Also not IMHO a serious problem, unless you pick some really weird positions or
>>have some really big problem with your game-playing setup.
>>
>>>
>>>So what are we to conclude from all this. Maybe no conclusions should be made
>>>until you've run 100 or 200 games,using a wide range of openings.  then your
>>>conclusion should be based on a conservative alpha of 1 or .1.???
>>>
>>
>>Let me try to explain again what I see as the root of the "problem".
>>
>>Let's say that I go and purposefully destroy the Rybka playing level, by for
>>example always returning 0 from the eval. Then, I start playing games, and the
>>new version is losing 3-0. This is already enough - I am now 100% sure that the
>>new version is weaker. As an intelligent tester, I combine the "raw statistics"
>>with my expectation of what is possible.
>>
>>When I test a small change, I have the knowledge (which the raw statistics do
>>not take into account) that the difference in playing level is very small. This
>>is not reflected in the tables, and is the reason why I think the tables are not
>>accurate (given the assumption of a small change). To use my previous  example,
>>if I make a small change, and the new version is scoring 7.5/10, it's obvious
>>that this is a statistical fluctuation. There cannot be 95% confidence at that
>>point that the change was good, although that is what the table says.
>>
>>Anyway, thanks for your work on this.
>>
>>Vas
>
>It is obvious that 75% is better than the real result in this case.
>I agree that if you know that the change is small enough and the expected
>average is 0 no result can be significant after 10 games but life is not always
>like that.
>
>Imagine that you made a change that made the program slightly faster in test
>positions thanks to better order of moves.
>
>You have a good reason to believe that the average is positive.
>You cannot be sure about it because you may have a bug but you can decide that
>you believe in aprior distribution of changes that give you 90% confidence that
>the program is better even with no games.
>
>Now you need less games to be sure with 95% confidence relative to the case that
>you assume that you have no idea if the change is negative or positive.
>
>Note that deciding about fixed 95% confidence is probably a mistake and it is
>more important not to do a mistake when the expected difference is bigger so if
>you do some changes when in one of them the expected change is bigger, then you
>can decide that you need bigger confidence that the change is positive in the
>case that the expected change is bigger.
>
>Uri



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.