Author: Dann Corbit
Date: 16:06:15 01/30/01
Go up one level in this thread
On January 30, 2001 at 18:43:12, Bruce Moreland wrote: >On January 30, 2001 at 17:42:59, Dann Corbit wrote: > >>Additional measurements will not (in general) make the answer less accurate >>(unless something is wrong with the measurements). > >If A is 1000 Elo points stronger than B, you will probably have a more accurate >answer after 20 games than you will after 100 games if A is 1 Elo point stronger >than B, and you get close to the expected 50-50 result. This is simply not correct. Let's choose a simpler model and something we know is about even: Heads or tails with a penny. Try ten flips, and have ten friends do the same. Most of the 11 experiments will not have 5/5 divisions of heads/tails. Repeat the experiment with a larger number of flips until you get bored with it. As you get a larger and larger number of measurements, the probability that you get close to the right answer increases. It does not decrease. And if you average the results from all 11 experimenters, you will get an even better result (on average). >It's not just number of games, another major factor is actual relative strength, >compared against the strength of the assertion you are trying to prove. This is correct. You will also have problems if the strength difference is too great. If (for instance) you have a program 200 ELO stronger and a program 10000 ELO stronger, you might get 10-0 blankings 3 times in a row. But there is an enormous difference in the strength of the programs. But if you play the two weaker programs against each other, that will help a lot. >If you are trying to prove that A is no worse than 1000 Elo points worse than B, >it will almost certainly be very easy to confidently make this assertion after >20 games, if the two programs are the same strength. If A really is about 1000 >points worse than B, it will be harder. Another good comment that I agree with. >"A is stronger than B" can be a very weak claim, or it can be a very strong one. > That is why there is no fixed amount of games necessary to prove this, it >depends upon the actual Elo difference as measured by the match. The best way to get a really good number is to play a very large number of games against a pool of very diverse talent. >Of course, if you ran 500 games you could certainly make a claim that the >difference can't be too far from what you have measured. If you get 252-248 you >can't declare that A is better than B, but you can certainly declare that A is >not likely to be much worse than B. You can declare anything (of course) but what a set of experiments will tell you is an ELO strength +/- some given window. Considering (again) the top two SSDF list entries: program hardware Rating + - Games Won Average-opposition 1 Fritz 6.0 128MB K6-2 450 MHz 2629 25 -24 845 67% 2506 2 Junior 6.0 128MB K6-2 450 MHz 2589 23 -22 1027 65% 2483 We see that within one standard deviation, Junior could be as strong as 2589 + 23 = 2612, and Frits could be as weak as 2629 - 24 = 2605. So, even within one standard deviation, we are not really sure which program is stronger. If we played only these two engines, it would be more difficult to get an accurate rating. After a thousand games or so, we would not be at all sure which is stronger. But after ten games, we would have no idea whatsoever. >At the risk of being repetitive, the difficulty of proving an assertion about >the strength of two programs, seems to be very dependent upon the degree to >which the assertion rides the razor edge of truth and falsity. If it's just >barely true, you may never prove it. This is undeniable. However, more data does nothing to detract from the quality of the answer. That was my only point. If I have two chess engines and I play one game, my error bar is infinite. If I play 100 games between them, the error bar is smaller, but still very large. If I play one trillion games between them, the error bar is very small (but still not zero). However, with each additional measurement, confidence in the calculated strength difference rises. The closer the two programs are in strength, the more difficult it becomes to find out which one is really stronger. When programs are of approximately the same strength, it is virtually impossible to prove which one is strongest.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.