Author: Bruce Moreland
Date: 17:04:42 02/04/01
Go up one level in this thread
On February 04, 2001 at 18:36:22, Walter Koroljow wrote: >Bruce, > >I am glad you are willing to do something constructive. I am coming into this >thread at this point, so forgive me if I do not fully understand the context. > >As I understand it, you want to measure things that are difficult to measure -- >e.g., whether your latest small change added a rating point to your program. >You also want to minimize testing time. To me this seems crucial for a chess >programmer. > >This seems to be a case for sequential testing. I have an old "Handbook of >Probability and Statistics" open in front of me. Let me give you two quotes >from the sequential testing section. > >"Sequential tests are frequently more economical than non-sequential tests, >especially if the number of trials is readily changed... Oftentimes the number >of trials called for in a sequential test of a given reliability is considerably >less than the number of trials required by the corresponding non-sequential >test." > >"In a sequential test, certain calculations are made after each trial...the >hypothesis is accepted or rejected as soon as it appears that the available data >are adequate for making a decision..." > >This is probably what you are doing already, but the theory will give you a >quantitative criterion for stopping testing. > >You might consider looking at the subject in a reference book or textbook, and >seeing if you are comfortable with it. > >Incidentally, I would be glad to help with the analysis (if you wanted help), >time permitting. I have a Ph.D in mathematical physics, the math courses for a >Masters, and have been producing systems that measure hard to measure things for >a living for 30 years. For example, my system (all design and analyses done by >me) is the only system that can, and routinely does, measure the (miniscule) >noise of the Seawolf submarine. So, I have done a few dozen statistical studies >professionally. > >Anyway, best of luck. > >Walter If I could compare two similar versions it would have some utility for me, but I'm not that interested in this. It's not the holy grail for me. I'm interested in the general case of what you can derive from a given match result, because this topic comes up often here. I think that if people who want to compare two programs could use a sequential test they'd be able to save a lot of time and make statements that are more correct. Often, we start out with two programs and we want to be able to make comparative statements about the two programs: 1) A is stronger than B. 2) There is a 98% chance that A is stronger than B. 3) There is a 98% chance that A is at least 20 Elo points stronger than B. Sometimes little is known about A and B. Sometimes A and B are exactly the same program, running on different hardware. Sometimes A and B are the same program, running on the same hardware, with a change to the software that can be predicted to be either major or minor with some degree of accuracy. When I read my statistics book they sometimes make things seem so simple, but it seems like things change a lot depending upon some of the secondary conditions. This particular thread and its predecessor involved the significance of a 10-0 result, as opposed to a 60-40 result. There are people who read this group who are intuitively fearful of results of short matches, even if they are blowouts. My initial argument was that 10-0 is much more rare than 60-40 (assuming a coin-flipping test), so if 10-0 happens you should be more confident about assessing the first program as better than the second, than if you had gotten a 60-40 result. Then I figured out that if you know that the two programs really are pretty close in strength, the ratio of good/bad 10-0 results is lower than the ratio of good/bad 60/40 results. It's a little bit hard to describe what I mean by this. Let's say that A wins a few percent more than B, on average. But you get this 10-0 result in this particular case. This should be very rare, but let's say that you do get it. You should get 10-0 favoring A more often than you get 10-0 favoring B, but you will get 10-0 favoring B sometimes in practical cases, as long as A isn't a lot better than B. The same thing is true of 60-40. The stronger one will win by 60-40 sometimes, but other times the weaker one will win by 60-40. So if you get one of these results you have to decide whether the stronger one won the match, or whether the weaker one did. If you say that one or the other won the match, there is a chance that you made a mistake. It is possible to compute this chance, and it just so happens that the chance that the weaker one won is distinctly higher in the 10-0 case. I'm trying to reconcile this with the fact that 10-0 is a less common result, which I had thought should lead me to have more confidence if I get 10-0 than if I get 60-40. I'm trying to get people to admit that sequential tests are a good thing. In order to do that I have to get people to accept that the result of a short match can be as valid as the result of a long match. But I am being overcome by this paradoxical situation involving small Elo deltas, which happen to be very practical Elo deltas. Maybe I have explained why I'm confused. I certainly hope that I haven't emitted still more incomprehensible text. I'll work this out, but if I've fallen into some well-known fallacy, please let me know. What I'd like to come up with eventually is a way of determining if a given match result proves something, up to the current point. Regarding this whole subject, there have been a few people who have said, "Well, just take an elementary statistics course." I think it's more complicated than that. You can't just look in an elementary text and figure out how to create a sequential design chart (the kind that came out of the work of Abraham Wald). bruce
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.