Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: LCTII test vs SSDF results

Author: Enrique Irazoqui

Date: 03:14:05 09/28/99

Go up one level in this thread


On September 27, 1999 at 18:55:12, Bruce Moreland wrote:

>Here is my point.  Let's say that you play 10,000 round robin tournaments
>between programs A, B, C, and D, and you get the following estimated ratings:
>
>A  2400
>B  2425
>C  2450
>D  2475
>
>I can build a test that will allow you to predict the ratings of these programs.
> The test is:
>
>    Elo = 2375 + N * 25
>
>Where N is the ordinal value of the name of the program (A=1, B=2, C=3, D=4),
>the "%" is the modulo operator, and integer math is used.
>
>This formula predicts the ratings of each program with absolute accuracy.  If
>you do this test at home, you will be able to predict the ratings yourself, and
>it will always work.
>
>Assume this test is done on a 200 mhz computer, and you guess that doubling
>speed is worth 100 Elo points.  Then the formula can be modified as follows:
>
>    Elo = 2275 + N * 25 + log2(mhz / 200) * 100

The formula I use is -log2(time)*60, assuming that a doubling in speed brings an
increase of 60 Elo points. Then I enter the formula in each position and finally
average all results. This way I get final results that look like -463, --467,
-474, etc, which gives me the difference in Elo between programs. Assuming the
best = 0, I get 0 (Tiger), -4 (N7.32), -11 (CM6K)... I don't even try to come
with absolute Elo figures.

If I want to compare the results with the SSDF list, I use the rating of the
program I test that in their list has played more games in a given platform
(Fritz 5 on a P200MMX gets 2566 after 946 games) and add whatever it takes to
make my rating of F5 = 2566, so I end up with the list I posted yesterday. It is
NOT tuned for the SSDF in any way and it only gives rating differences in a way
I can compare with any rating list around.

I am sure that anyone can come with similar results, providing that there are
enough positions (76 in my test) that are not cooked, not ambiguous, not too
easy, not impossible to solve. This discards all published test sets, because
they are heavily cooked. This is why I don't post the positions I use.

>This will produce the same values as before, only it will also accurately
>predict the rating if you increase processor speed.
>
>You can't argue with this test, it is a perfect predictor for these four
>programs on any computer, assuming that 100 points per doubling holds up.
>
>Why won't anyone agree that this is a good test?  Because the test has no
>relevance to the strength of the programs.  I couldn't get anyone to agree that
>what you call your program matters enough that it affects the rating like this.
>
>But look at these suite tests.  They are also created using a fixed set of
>programs, on a particular processor, and calibrated to a scale that has been
>predetermined (typically the SSDF list, or someone's feelings about what the
>SSDF list should really show).  Do you think that this isn't done?  I can't
>imagine someone just doing a test and picking a formula at random and magically
>the right Elo numbers come out.  No, the test and the formula are both
>calibrated against some predefined reality.

Not always. I picked positions carefully and the final results fell wherever.
That these results are so close in relative strength to the SSDF list surprised
me quite a bit. In fact, when I started building this test set I wanted to use
it as a way to find the difference in not-tactical strength between programs: I
would get a result for tactics, and the difference between this and the real
performance would be not-tactics. As it appears to be, this not-tactics is
comparable in strength in all the best programs, while pure tactics is what
makes the difference in comp-comp.

>When you run one of these suites at home, on one of the programs that the test
>was calibrated with, you are just replaying the number that the suite author
>determined that the program should get.  It's not predicting anything, the whole
>test is a recording of predetermined "facts", for many of the most popular
>programs.
>
>There is some question about whether the suite can be accurate for the programs
>that it wasn't calibrated with, but since most of these probably use the most
>popular programs to calibrate, you can't really compare anything.  If one of the
>calibration programs comes back 2475, and some new program comes back with 2450,
>who is to say that that the new program is really weaker, since the test has
>been fiddled with until it produces 2475 for the first program?

And nevertheless... I tested programs not included in the SSDF list, but that
have played in their games under unidentified codes like "9", etc. Maybe they
can tell us if my rating for these programs and theirs is significantly
different. I don't think it is.

Of course, there is an important limitation in this kind of tests: any
programmer can easily tune for tactics. In this case, the program in question
will perform better in the test and worse in real life. But somehow it doesn't
happen, because programmers do not tune for tests but for real playing. In any
case, the results I posted yesterday are so close in relative strength to the
SSDF list that I have a difficult time believing it is just a coincidence.

Enrique

>bruce



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.