Author: Joseph Ciarrochi
Date: 04:51:24 02/06/06
Go up one level in this thread
Actually, i thought of a good gambling metaphor that might help. Lets say we are comparing A: rbyka x to rybka x+1 b: rybka y to rybka y+1 in one of these, the engine has been improved to have a 10% higher win rate (i.e., will win 60% of the time). In the other case, the engine is exactly the same (50% win rate). You have ten games to run on both, and you have to bet $1000 on which one improved. Case A: you obtain 8.5 out of 10 (85%) Case B: you obtain 6 out of ten (60%) Cleary, case A is an overestimate of the true improvement, since you know that it can have no more than a true 60% win rate. Still, you would bet on case A, because it is less likely to have a 50% win rate. 8.5 /10 is a meaningful finding best Joseph On February 06, 2006 at 07:39:23, Joseph Ciarrochi wrote: >Yes, this are good points by both you, vasik, and uri. Certainly the tables are >not constrained by any a-priori knowledge. If you do have such knowledge, than >this should be used, by all means. As uri points out, you can increase power if >you have knowledge that leads you to make a directional hypothesis (which is >what the tables are based on , btw). When i am doing statistical analysis in >science, I use as much theory as i can to constrain the models. > >In one example, you say that you know you have made a minor change to an engine >and you know a-priori that the change in strength is very small, than you would >not assume any big change. If you observed a big change (you said 7.5/10), then >you would not trust it. > >So it sounds like you are making a critical assumption, namely, given the two >engines are very similar, they will tend to obtain very similar scores against >each other. Draw rate would be presumably higher and variance lower. You would >not use the table I generated, because it assumes an avarage draw rate and >variability (obtained when different engines play each other). You would need a >new table that is based on your error rate, which, being lower, would mean that >you need smaller values to detect a difference. > > >In your particular case (testing beta x versus beta x +1), I think you would >want to generate a table that gives you the odds of particular scores, when the >engine is playing itself. This will have a much higher draw rate, lower >variability, and values like 7.5/10 will be rarer. (btw, i posted some results >where draw rate was very low, as in the case of human blitz, and the cut-offs >become more stringent). > >Ok, now let's say you do obtain a score of 7.5/10, and you have strong >theoretical reasons to expect that this is inflated. That's ok. you can run >another 100 games or so and get closer to the true value. But this does not mean >that 7.5 is meaningless. > >let's say that the odds of Beta X+1 winning against beta x is = T+ E, where T = >true strength advantage over Beta X , and E= error. > >The bigger T is, the more likely you are to obtain 7.5/10 or higher, even if >this value is inflated. in contrast, if T is 0, then 7.5 is less likely to >occur. So from a statistical point of view, obtaining a 7.5 means that T is less >likely to be 0 (compared to the case when you obtain a 5/10). So the result is >informative, even if the estimate, 7.5, is a bit high. > > > > >best >Joseph > > > > > > > > > > > > > > > > >On February 05, 2006 at 20:01:02, Uri Blass wrote: > >>On February 05, 2006 at 06:30:37, Vasik Rajlich wrote: >> >>>On February 04, 2006 at 05:13:20, Joseph Ciarrochi wrote: >>> >>>>>Thanks for the tables, however I don't think that they are appropriate for the >>>>>most common test scenario: >>>>> >>>>>1) You have some existing version of the program, let's call it Beta x. >>>>>2) You spend a week or so making changes, leading to a new version, let's call >>>>>it Beta (x+1). You hope that these changes give you 10 or 20 rating points, and >>>>>you pretty much know that they won't give you more than that. >>>>>3) You want to make sure that Beta (x+1) is not actually weaker than Beta x. >>>>> >>>>>In this case, if Beta (x+1) scores 7.5/10 against Beta x, you cannot be 95% sure >>>>>that it is not weaker. >>>> >>>> >>>> >>> >>>Hi Joseph, >>> >>>there is a lot to think about here. I will need to spend a few days after the >>>1.2 release thinking about this and coming up with a real testing procedure. >>>Please find my initial comments below: >>> >>>>Hi vasik, always good to read your emails. >>>> >>>>yes, Vasik, I agree that a 5% cut-off is not sufficiently conservative and will >>>>artificially increase the number of false positives. I reckon you should use the >>>>.1 cut-off. I agree with your conclusions. >>>> >>> >>>I don't think this is realistic. Some changes are small enough that you will >>>simply never be able to get 99.9% confidence. >>> >>>Also, please note that 95% confidence is rather acceptable. This means taking 19 >>>steps forward, 1 step back - not perfect, but close enough. >>> >>>>The problem isn't really with the table. The table correctly represents the >>>>odds of a particular result happening by chance, if you take ***one****sample of >>>>N. >>>> >>>>Now, the main issue here is not with the single comparison rate, but with >>>>familywise error rate. in the case of rybka, familywise error estimate might be >>>>based on the number betas +1 that are tested against old betas. so rybka has >>>>about 14 betas i think. Let's assume this entails 13 tests (e.g., beta +1 versus >>>>beta) (and i know you already understand what i say below, because i saw you >>>>describe exactly this statistical issue in a previous email) >>>> >>>>So using the formula i put down below, set alpha = .05, then the chance of >>>>making at least one false positive = 1 – (1-alpha)^C;= about 49% chance. So you >>>>probably will make at least one false conclusion, maybe more if you use .05 >>>> >>> >>>This is perfectly fine. Actually, there were just five attempts to improve the >>>playing level (Beta 9, 10, 11, 12, and 13b). The chance is 19/20 ^ 5 that 95% >>>confidence data would be correct regarding whether every single one of those >>>attempts did or did not succeed. The chance would be really low that more than >>>one mistake was made. >>> >>>>Now when you are comparing different parameter configurations, you get into some >>>>big family wise error issues. 30 comparisons means you have about a 79% chance >>>>of making one or more errors. >>>> >>> >>>This is also fine. Errors are inevitable. 5% is quite ok. Sure, the chance is >>>79% chance to make at least one error, but the chance is also <<5% to make >>>overall progress. This is the important thing. >>> >>>>Other critical issues: >>>> >>>>**if you test using blitz controls, then the draw rate is lower and error rate >>>>higher than what I have utilized in this table (draw rate = .32). This table >>>>will not be sufficiently conservative for blitz (still, .1 cut-off is pretty >>>>conservative) >>> >>>This is IMHO not a major issue. >>> >>>>**There is another problem, not to due with error rate and false positives.. >>>>This has to due with external validity. e.g., would your results with an N = 10 >>>>generalize to another random sample of 10 games. Well, I think there is no way >>>>you can get a representative sample of opening positions with n=10. you probably >>>>need at least 50 and preferably a set like you have created (consisting of 260 i >>>>believe). >>>> >>> >>>Also not IMHO a serious problem, unless you pick some really weird positions or >>>have some really big problem with your game-playing setup. >>> >>>> >>>>So what are we to conclude from all this. Maybe no conclusions should be made >>>>until you've run 100 or 200 games,using a wide range of openings. then your >>>>conclusion should be based on a conservative alpha of 1 or .1.??? >>>> >>> >>>Let me try to explain again what I see as the root of the "problem". >>> >>>Let's say that I go and purposefully destroy the Rybka playing level, by for >>>example always returning 0 from the eval. Then, I start playing games, and the >>>new version is losing 3-0. This is already enough - I am now 100% sure that the >>>new version is weaker. As an intelligent tester, I combine the "raw statistics" >>>with my expectation of what is possible. >>> >>>When I test a small change, I have the knowledge (which the raw statistics do >>>not take into account) that the difference in playing level is very small. This >>>is not reflected in the tables, and is the reason why I think the tables are not >>>accurate (given the assumption of a small change). To use my previous example, >>>if I make a small change, and the new version is scoring 7.5/10, it's obvious >>>that this is a statistical fluctuation. There cannot be 95% confidence at that >>>point that the change was good, although that is what the table says. >>> >>>Anyway, thanks for your work on this. >>> >>>Vas >> >>It is obvious that 75% is better than the real result in this case. >>I agree that if you know that the change is small enough and the expected >>average is 0 no result can be significant after 10 games but life is not always >>like that. >> >>Imagine that you made a change that made the program slightly faster in test >>positions thanks to better order of moves. >> >>You have a good reason to believe that the average is positive. >>You cannot be sure about it because you may have a bug but you can decide that >>you believe in aprior distribution of changes that give you 90% confidence that >>the program is better even with no games. >> >>Now you need less games to be sure with 95% confidence relative to the case that >>you assume that you have no idea if the change is negative or positive. >> >>Note that deciding about fixed 95% confidence is probably a mistake and it is >>more important not to do a mistake when the expected difference is bigger so if >>you do some changes when in one of them the expected change is bigger, then you >>can decide that you need bigger confidence that the change is positive in the >>case that the expected change is bigger. >> >>Uri
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.