Author: Vasik Rajlich
Date: 08:34:49 02/06/06
Go up one level in this thread
On February 06, 2006 at 07:51:24, Joseph Ciarrochi wrote: >Actually, i thought of a good gambling metaphor that might help. > >Lets say we are comparing > >A: rbyka x to rybka x+1 >b: rybka y to rybka y+1 > > >in one of these, the engine has been improved to have a 10% higher win rate >(i.e., will win 60% of the time). In the other case, the engine is exactly the >same (50% win rate). You have ten games to run on both, and you have to bet >$1000 on which one improved. > >Case A: you obtain 8.5 out of 10 (85%) >Case B: you obtain 6 out of ten (60%) > >Cleary, case A is an overestimate of the true improvement, since you know that >it can have no more than a true 60% win rate. Still, you would bet on case A, >because it is less likely to have a 50% win rate. 8.5 /10 is a meaningful >finding > This is true - but .. :) First of all, 60% winning rate is a bit high, 10 rating points is a more normal change and this gives somewhere around 52%. At 52%, the difference between betting on case A and betting on case B is not very high. In contrast, if you took two engines randomly from the set of all existing engines, and had to bet on which one is stronger, then betting on case A is a much better bet than beting on case B. Really, what it comes down to: what is the adjustment in confidence you have (starting from the tables you presented as the default) as a result of your knowledge about the change you made. More practically: if the knowledge you have is that the change is small, how do the new tables look like. Vas > >best >Joseph > > > > >On February 06, 2006 at 07:39:23, Joseph Ciarrochi wrote: > >>Yes, this are good points by both you, vasik, and uri. Certainly the tables are >>not constrained by any a-priori knowledge. If you do have such knowledge, than >>this should be used, by all means. As uri points out, you can increase power if >>you have knowledge that leads you to make a directional hypothesis (which is >>what the tables are based on , btw). When i am doing statistical analysis in >>science, I use as much theory as i can to constrain the models. >> >>In one example, you say that you know you have made a minor change to an engine >>and you know a-priori that the change in strength is very small, than you would >>not assume any big change. If you observed a big change (you said 7.5/10), then >>you would not trust it. >> >>So it sounds like you are making a critical assumption, namely, given the two >>engines are very similar, they will tend to obtain very similar scores against >>each other. Draw rate would be presumably higher and variance lower. You would >>not use the table I generated, because it assumes an avarage draw rate and >>variability (obtained when different engines play each other). You would need a >>new table that is based on your error rate, which, being lower, would mean that >>you need smaller values to detect a difference. >> >> >>In your particular case (testing beta x versus beta x +1), I think you would >>want to generate a table that gives you the odds of particular scores, when the >>engine is playing itself. This will have a much higher draw rate, lower >>variability, and values like 7.5/10 will be rarer. (btw, i posted some results >>where draw rate was very low, as in the case of human blitz, and the cut-offs >>become more stringent). >> >>Ok, now let's say you do obtain a score of 7.5/10, and you have strong >>theoretical reasons to expect that this is inflated. That's ok. you can run >>another 100 games or so and get closer to the true value. But this does not mean >>that 7.5 is meaningless. >> >>let's say that the odds of Beta X+1 winning against beta x is = T+ E, where T = >>true strength advantage over Beta X , and E= error. >> >>The bigger T is, the more likely you are to obtain 7.5/10 or higher, even if >>this value is inflated. in contrast, if T is 0, then 7.5 is less likely to >>occur. So from a statistical point of view, obtaining a 7.5 means that T is less >>likely to be 0 (compared to the case when you obtain a 5/10). So the result is >>informative, even if the estimate, 7.5, is a bit high. >> >> >> >> >>best >>Joseph >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>On February 05, 2006 at 20:01:02, Uri Blass wrote: >> >>>On February 05, 2006 at 06:30:37, Vasik Rajlich wrote: >>> >>>>On February 04, 2006 at 05:13:20, Joseph Ciarrochi wrote: >>>> >>>>>>Thanks for the tables, however I don't think that they are appropriate for the >>>>>>most common test scenario: >>>>>> >>>>>>1) You have some existing version of the program, let's call it Beta x. >>>>>>2) You spend a week or so making changes, leading to a new version, let's call >>>>>>it Beta (x+1). You hope that these changes give you 10 or 20 rating points, and >>>>>>you pretty much know that they won't give you more than that. >>>>>>3) You want to make sure that Beta (x+1) is not actually weaker than Beta x. >>>>>> >>>>>>In this case, if Beta (x+1) scores 7.5/10 against Beta x, you cannot be 95% sure >>>>>>that it is not weaker. >>>>> >>>>> >>>>> >>>> >>>>Hi Joseph, >>>> >>>>there is a lot to think about here. I will need to spend a few days after the >>>>1.2 release thinking about this and coming up with a real testing procedure. >>>>Please find my initial comments below: >>>> >>>>>Hi vasik, always good to read your emails. >>>>> >>>>>yes, Vasik, I agree that a 5% cut-off is not sufficiently conservative and will >>>>>artificially increase the number of false positives. I reckon you should use the >>>>>.1 cut-off. I agree with your conclusions. >>>>> >>>> >>>>I don't think this is realistic. Some changes are small enough that you will >>>>simply never be able to get 99.9% confidence. >>>> >>>>Also, please note that 95% confidence is rather acceptable. This means taking 19 >>>>steps forward, 1 step back - not perfect, but close enough. >>>> >>>>>The problem isn't really with the table. The table correctly represents the >>>>>odds of a particular result happening by chance, if you take ***one****sample of >>>>>N. >>>>> >>>>>Now, the main issue here is not with the single comparison rate, but with >>>>>familywise error rate. in the case of rybka, familywise error estimate might be >>>>>based on the number betas +1 that are tested against old betas. so rybka has >>>>>about 14 betas i think. Let's assume this entails 13 tests (e.g., beta +1 versus >>>>>beta) (and i know you already understand what i say below, because i saw you >>>>>describe exactly this statistical issue in a previous email) >>>>> >>>>>So using the formula i put down below, set alpha = .05, then the chance of >>>>>making at least one false positive = 1 – (1-alpha)^C;= about 49% chance. So you >>>>>probably will make at least one false conclusion, maybe more if you use .05 >>>>> >>>> >>>>This is perfectly fine. Actually, there were just five attempts to improve the >>>>playing level (Beta 9, 10, 11, 12, and 13b). The chance is 19/20 ^ 5 that 95% >>>>confidence data would be correct regarding whether every single one of those >>>>attempts did or did not succeed. The chance would be really low that more than >>>>one mistake was made. >>>> >>>>>Now when you are comparing different parameter configurations, you get into some >>>>>big family wise error issues. 30 comparisons means you have about a 79% chance >>>>>of making one or more errors. >>>>> >>>> >>>>This is also fine. Errors are inevitable. 5% is quite ok. Sure, the chance is >>>>79% chance to make at least one error, but the chance is also <<5% to make >>>>overall progress. This is the important thing. >>>> >>>>>Other critical issues: >>>>> >>>>>**if you test using blitz controls, then the draw rate is lower and error rate >>>>>higher than what I have utilized in this table (draw rate = .32). This table >>>>>will not be sufficiently conservative for blitz (still, .1 cut-off is pretty >>>>>conservative) >>>> >>>>This is IMHO not a major issue. >>>> >>>>>**There is another problem, not to due with error rate and false positives.. >>>>>This has to due with external validity. e.g., would your results with an N = 10 >>>>>generalize to another random sample of 10 games. Well, I think there is no way >>>>>you can get a representative sample of opening positions with n=10. you probably >>>>>need at least 50 and preferably a set like you have created (consisting of 260 i >>>>>believe). >>>>> >>>> >>>>Also not IMHO a serious problem, unless you pick some really weird positions or >>>>have some really big problem with your game-playing setup. >>>> >>>>> >>>>>So what are we to conclude from all this. Maybe no conclusions should be made >>>>>until you've run 100 or 200 games,using a wide range of openings. then your >>>>>conclusion should be based on a conservative alpha of 1 or .1.??? >>>>> >>>> >>>>Let me try to explain again what I see as the root of the "problem". >>>> >>>>Let's say that I go and purposefully destroy the Rybka playing level, by for >>>>example always returning 0 from the eval. Then, I start playing games, and the >>>>new version is losing 3-0. This is already enough - I am now 100% sure that the >>>>new version is weaker. As an intelligent tester, I combine the "raw statistics" >>>>with my expectation of what is possible. >>>> >>>>When I test a small change, I have the knowledge (which the raw statistics do >>>>not take into account) that the difference in playing level is very small. This >>>>is not reflected in the tables, and is the reason why I think the tables are not >>>>accurate (given the assumption of a small change). To use my previous example, >>>>if I make a small change, and the new version is scoring 7.5/10, it's obvious >>>>that this is a statistical fluctuation. There cannot be 95% confidence at that >>>>point that the change was good, although that is what the table says. >>>> >>>>Anyway, thanks for your work on this. >>>> >>>>Vas >>> >>>It is obvious that 75% is better than the real result in this case. >>>I agree that if you know that the change is small enough and the expected >>>average is 0 no result can be significant after 10 games but life is not always >>>like that. >>> >>>Imagine that you made a change that made the program slightly faster in test >>>positions thanks to better order of moves. >>> >>>You have a good reason to believe that the average is positive. >>>You cannot be sure about it because you may have a bug but you can decide that >>>you believe in aprior distribution of changes that give you 90% confidence that >>>the program is better even with no games. >>> >>>Now you need less games to be sure with 95% confidence relative to the case that >>>you assume that you have no idea if the change is negative or positive. >>> >>>Note that deciding about fixed 95% confidence is probably a mistake and it is >>>more important not to do a mistake when the expected difference is bigger so if >>>you do some changes when in one of them the expected change is bigger, then you >>>can decide that you need bigger confidence that the change is positive in the >>>case that the expected change is bigger. >>> >>>Uri
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.