Computer Chess Club Archives


Search

Terms

Messages

Subject: a gambling metaphor: How would you bet?

Author: Joseph Ciarrochi

Date: 04:51:24 02/06/06

Go up one level in this thread


Actually, i thought of a good gambling metaphor that might help.

Lets say we are comparing

A: rbyka x to rybka x+1
b: rybka y to rybka y+1


in one of these, the engine has been improved to have a 10% higher win rate
(i.e., will win 60% of the time). In the other case, the engine is exactly the
same (50% win rate). You have ten games to run on both, and you have to bet
$1000 on which one improved.

Case A: you obtain 8.5 out of 10 (85%)
Case B: you obtain 6 out of ten (60%)

Cleary, case A is an overestimate of the true improvement, since you know that
it can have no more than a true 60% win rate. Still, you would bet on case A,
because it is less likely to have a 50% win rate. 8.5 /10 is a meaningful
finding


best
Joseph




On February 06, 2006 at 07:39:23, Joseph Ciarrochi wrote:

>Yes, this are good points by both you, vasik, and uri. Certainly the tables are
>not constrained by any a-priori knowledge. If you do have such knowledge, than
>this should be used, by all means. As uri points out, you can increase power if
>you have knowledge that leads you to make a directional hypothesis (which is
>what the tables are based on , btw). When i am doing statistical analysis in
>science, I use as much theory as i can to constrain the models.
>
>In one example, you say that you know you have made a minor change to an engine
>and you know a-priori that the change  in strength is very small, than you would
>not assume any big change. If you observed a big change (you said 7.5/10), then
>you would not trust it.
>
>So it sounds like you are making a critical assumption, namely,  given the two
>engines are very similar, they will tend to obtain very similar scores against
>each other. Draw rate would be presumably higher and variance lower. You would
>not use the table I generated, because it assumes an avarage draw rate and
>variability (obtained when different engines play each other). You would need a
>new table that is based on your error rate, which, being lower, would mean that
>you need smaller values to detect a difference.
>
>
>In your particular case (testing beta x versus beta x +1), I think you would
>want to generate a table that gives you the odds of particular scores, when the
>engine is playing itself. This will have a much higher draw rate, lower
>variability, and values like 7.5/10 will be rarer. (btw, i posted some results
>where draw rate was very low, as in the case of human blitz, and the cut-offs
>become more stringent).
>
>Ok, now let's say you do obtain a score of 7.5/10, and you have strong
>theoretical reasons to expect that this is inflated. That's ok. you can run
>another 100 games or so and get closer to the true value. But this does not mean
>that 7.5 is meaningless.
>
>let's say that the odds of Beta X+1  winning against beta x is = T+ E, where T =
>true strength advantage over Beta X , and E= error.
>
>The bigger T is, the more likely you are to obtain 7.5/10 or higher, even if
>this value is inflated.  in contrast, if T is 0, then 7.5 is less likely to
>occur. So from a statistical point of view, obtaining a 7.5 means that T is less
>likely to be 0 (compared to the case when you obtain a 5/10). So the result is
>informative, even if the estimate, 7.5, is a bit high.
>
>
>
>
>best
>Joseph
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>On February 05, 2006 at 20:01:02, Uri Blass wrote:
>
>>On February 05, 2006 at 06:30:37, Vasik Rajlich wrote:
>>
>>>On February 04, 2006 at 05:13:20, Joseph Ciarrochi wrote:
>>>
>>>>>Thanks for the tables, however I don't think that they are appropriate for the
>>>>>most common test scenario:
>>>>>
>>>>>1) You have some existing version of the program, let's call it Beta x.
>>>>>2) You spend a week or so making changes, leading to a new version, let's call
>>>>>it Beta (x+1). You hope that these changes give you 10 or 20 rating points, and
>>>>>you pretty much know that they won't give you more than that.
>>>>>3) You want to make sure that Beta (x+1) is not actually weaker than Beta x.
>>>>>
>>>>>In this case, if Beta (x+1) scores 7.5/10 against Beta x, you cannot be 95% sure
>>>>>that it is not weaker.
>>>>
>>>>
>>>>
>>>
>>>Hi Joseph,
>>>
>>>there is a lot to think about here. I will need to spend a few days after the
>>>1.2 release thinking about this and coming up with a real testing procedure.
>>>Please find my initial comments below:
>>>
>>>>Hi vasik, always good to read your emails.
>>>>
>>>>yes, Vasik, I agree that a 5% cut-off is not sufficiently conservative and will
>>>>artificially increase the number of false positives. I reckon you should use the
>>>>.1 cut-off. I agree with your conclusions.
>>>>
>>>
>>>I don't think this is realistic. Some changes are small enough that you will
>>>simply never be able to get 99.9% confidence.
>>>
>>>Also, please note that 95% confidence is rather acceptable. This means taking 19
>>>steps forward, 1 step back - not perfect, but close enough.
>>>
>>>>The problem isn't really with the table.  The table correctly represents the
>>>>odds of a particular result happening by chance, if you take ***one****sample of
>>>>N.
>>>>
>>>>Now, the main issue here is not with the single comparison rate, but with
>>>>familywise error rate. in the case of rybka, familywise error estimate might be
>>>>based on the number betas +1 that are tested against old betas. so rybka has
>>>>about 14 betas i think. Let's assume this entails 13 tests (e.g., beta +1 versus
>>>>beta) (and i know you already understand what i say below, because i saw you
>>>>describe exactly this statistical issue in a previous email)
>>>>
>>>>So using the formula i put down below, set alpha = .05, then the chance of
>>>>making at least one false positive = 1 – (1-alpha)^C;= about 49% chance. So you
>>>>probably will make at least one false conclusion, maybe more if you use .05
>>>>
>>>
>>>This is perfectly fine. Actually, there were just five attempts to improve the
>>>playing level (Beta 9, 10, 11, 12, and 13b). The chance is 19/20 ^ 5 that 95%
>>>confidence data would be correct regarding whether every single one of those
>>>attempts did or did not succeed. The chance would be really low that more than
>>>one mistake was made.
>>>
>>>>Now when you are comparing different parameter configurations, you get into some
>>>>big family wise error issues. 30 comparisons means you have about a 79% chance
>>>>of making one or more errors.
>>>>
>>>
>>>This is also fine. Errors are inevitable. 5% is quite ok. Sure, the chance is
>>>79% chance to make at least one error, but the chance is also <<5% to make
>>>overall progress. This is the important thing.
>>>
>>>>Other critical issues:
>>>>
>>>>**if you test using blitz controls, then the draw rate is lower and error rate
>>>>higher than what I have utilized in this table (draw rate = .32). This table
>>>>will not be sufficiently conservative for blitz (still, .1 cut-off is pretty
>>>>conservative)
>>>
>>>This is IMHO not a major issue.
>>>
>>>>**There is another problem, not to due with error rate and false positives..
>>>>This has to due with external validity. e.g., would your results with an N = 10
>>>>generalize to another random sample of 10 games. Well, I think there is no way
>>>>you can get a representative sample of opening positions with n=10. you probably
>>>>need at least 50 and preferably a set like you have created (consisting of 260 i
>>>>believe).
>>>>
>>>
>>>Also not IMHO a serious problem, unless you pick some really weird positions or
>>>have some really big problem with your game-playing setup.
>>>
>>>>
>>>>So what are we to conclude from all this. Maybe no conclusions should be made
>>>>until you've run 100 or 200 games,using a wide range of openings.  then your
>>>>conclusion should be based on a conservative alpha of 1 or .1.???
>>>>
>>>
>>>Let me try to explain again what I see as the root of the "problem".
>>>
>>>Let's say that I go and purposefully destroy the Rybka playing level, by for
>>>example always returning 0 from the eval. Then, I start playing games, and the
>>>new version is losing 3-0. This is already enough - I am now 100% sure that the
>>>new version is weaker. As an intelligent tester, I combine the "raw statistics"
>>>with my expectation of what is possible.
>>>
>>>When I test a small change, I have the knowledge (which the raw statistics do
>>>not take into account) that the difference in playing level is very small. This
>>>is not reflected in the tables, and is the reason why I think the tables are not
>>>accurate (given the assumption of a small change). To use my previous  example,
>>>if I make a small change, and the new version is scoring 7.5/10, it's obvious
>>>that this is a statistical fluctuation. There cannot be 95% confidence at that
>>>point that the change was good, although that is what the table says.
>>>
>>>Anyway, thanks for your work on this.
>>>
>>>Vas
>>
>>It is obvious that 75% is better than the real result in this case.
>>I agree that if you know that the change is small enough and the expected
>>average is 0 no result can be significant after 10 games but life is not always
>>like that.
>>
>>Imagine that you made a change that made the program slightly faster in test
>>positions thanks to better order of moves.
>>
>>You have a good reason to believe that the average is positive.
>>You cannot be sure about it because you may have a bug but you can decide that
>>you believe in aprior distribution of changes that give you 90% confidence that
>>the program is better even with no games.
>>
>>Now you need less games to be sure with 95% confidence relative to the case that
>>you assume that you have no idea if the change is negative or positive.
>>
>>Note that deciding about fixed 95% confidence is probably a mistake and it is
>>more important not to do a mistake when the expected difference is bigger so if
>>you do some changes when in one of them the expected change is bigger, then you
>>can decide that you need bigger confidence that the change is positive in the
>>case that the expected change is bigger.
>>
>>Uri



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.