Computer Chess Club Archives


Search

Terms

Messages

Subject: General Objection Against CEGT Stats

Author: Rolf Tueschen

Date: 05:04:54 12/07/05


If we think about a testing design we dream of as much data we could get because
we know that statistical significance has something to do with HIGH numbers of
trials, games or data. Please believe me that I dont want to bash all sorts of
activities in the testing hobby. This is just a plea to care and to be attentive
of what one is doing.

Say you (general you) have three, just these three, top engines and 500 engines
on the free market with different strength.

Could you just do the testing the way it's done on CEGT? I have serious doubts.

Look at this: say these three top acts are incredibly stronger in chess strength
than all th other 500 (which is apparently NOT the case in CEGT!) then what you
are testing in such little 20 or so games matches? Are you really testing chess
strength? I dont think so.

In my view the following is tested. How well the top engines solve the different
technical problems during tournament play. Just see the 14 SHREDDER losses in
the 300 rating. Compare it with FRITZ.

I dont want to be boring with mathematical calculations but let me say it in
speech.

The more opponents of relatively weaker strength you match with three or say
five top programs, the more irrelevant technical details or also chess depending
singularities (exceptions in the game) sum up and influence your ranking.

You must decide what you want to get. You are not interested in the testing of
the top programs. You want to get a ranking of the many free engines or amateurs
at least. Isnt it?

I say that you cant compare these many with the top three. You could better test
without them. Because the assumption is a delusion that you now by using the
comparison with the top very few you get a reasonable "Elo" or whatever you call
it for the "little" engines. Believing into such a mechanism is the same error
type the SSDF people made for years. You remember. They once "calibrated" their
tests with some (!) few (!) games against IM or Swedish masters. At the stoneage
times of CC. And then later they somehow wriggled around with this calibration
to give a reasonably looking Elo figure. On the base of the games of these
masters against MEPHISTO I dont know more.
Such a testing is absolutely nonsense.

In other words. You never know exactly what you are really testing. Here in CEGT
it would be way better if you tested among the 500 amateurs. Then you will get a
ranking over time. But to test how a new engine like Rybka would do against
SHREDDER or FRITZ or CHESSMASTER, you must create a different testing. For that
question it only is disturbing noise to watch all the results of these 500
engines.

Please ask if something is not understandable. I wrote this to prevent that
later after enormous attempts the whole results would be criticised. That would
be a pity for all the very motivated fans of our hobby CC. So please ask before
you go on tangents because you think that I am nuts with my critic.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.