Computer Chess Club Archives


Search

Terms

Messages

Subject: CEGT 1 and 2 comparison (long text)

Author: Heinz van Kempen

Date: 02:28:00 04/19/05


Hi all  ,

CEGT 2 is finished. We now have 1088 game for each of the top engines. The
websites are updated.

http://www.chessfighters.de/cegt/

http://www.husvankempen.de/nunn/

Christian has done some great statistics comparing CEGT 1 and 2 and the
performance on Athlon and Pentium respectively

http://www.chessfighters.de/cegt/html/statistics.html

and for those who really want to read more about a project that as already in
progress for some months now here is an explanation from my side:

<<this is to compare CEGT 1 and 2 and to explain why
we think that it is necessary to play at least one
thousand games for each engine to draw any conclusions,
because then we have error bars in the rating list of
around +- 18 ELO points what is still quite a lot.

So let us see: CEGT 1 and 2 were played on the same
computers using a wide spectrum of all Nunn positions
and some general books. Opponents for the engines were
partly different, but this alone does not explain the
huge differences you still have with only 500 games.
The differences are due to statistical abberations,
being of course even more pronounced with only 300, 200
or 100 games for each engine.


CEGT 1 4080 games, 16 engines, 510 games for each engine

CEGT 2 5202 games, 18 engines, 578 games for each engine

totally 1088 games each for each top engine


Similarities and striking differences:

-Shredder 9 dominated in both. ELO after totally 1088
games now is exactly 50 points higher than for the next
best - Fritz 8 Bilbao

-Junior 9 seemed to be second best engines in CEGT 1
scoring 17,5 points more than Fritz 8, but then in
CEGT 2 the same Fritz 8 version scored 41 points more
from the games than Junior !!! Almost unbelievable.
Now the rating from Fritz from the combined CEGT´s
is 15 points superior compared to Junior

-Hiarcs and Chess Tiger equally good in both CEGT´s,
but it is striking that Chess Tiger performs overall
much better on Athlon than on Pentium CPU´s

-Gandalf 6, really weird. Rank 4 in CEGT 1 scoring 15.5
points more from games than Hiarcs. Only rank 9 in CEGT 2
and 23.5 points less in the overall table than Hiarcs

-Ruffian 2.1.0 in both tournaments behind most other
commercials

-List, only rank 10 in CEGT 1, but number 6 in the total
score of CEGT 2 after 578 games and better than Ruffian
and Gandalf in that one

-ProDeo with a good performance in both. After 1088 games
for each we have 1 point rating difference between List and
ProDeo, so we will not even be able to tell after 5000 games
or more which one is the best amateur

-Chessmaster Steadfast is 32 ELO points better than CMX Yoda,
but with only above 500 points for each setting it would not
be correct to claim that it is better. Believe it or not,
just look at the error bars

-SOS 5 started furiously in CEGT 1 and for a long time this
seemed to be the best amateur. But after 400, 600, 800 games
it dropped and dropped and now Fruit is ahead of SOS, because
it performed the better the longer the tournaments lasted.
Anyway after 1088 games SOS 5 ist still ahead of Aristarch,
what means considerable improvement over SOS 4 still.

After seeing all this we do not dare to draw any conclusions at
all for those engines where we have only above 500 games so far.

Considering all this and no matter of the time control, maybe Blitz,
maybe the longer time controls we are using, it is for example
absurd to state after only 100 games that Engine X is improved by
50 ELO points or that it could already be seen that a new version
is not better than the previous one

Let me give an example. You have Engine X version 3.0 and play 100
games against different opponents and you have Engine X version 4.0
and you play 100 games under the same conditions. Then you get the
same rating for both and claim that there is no improvement. You
repeat this and play again 100 games for both and found that the new
one is 120 points better and you are enthusiastic about the
improvement and you play a third time and now the old version is 120
ELO points better and you are disappointed. But when you look at the
error bars you find that all the results are statistically in the
normal range, because error bars with 100 games for an engine version
are +-60. And then there are still 5% that still drop out of those
statistical normal results. Can you then understand how hard it is to
come to any conclusions at all?

This is why we are continuing to investigate the "truth" and do
not post quick sensational results after only 50 or 100 games
for a new engine and it may happen (when we will not lose our energy)
that we will soon even have 1500 or 2000 games for the top commercials
and at least 1000 for more and more top amateurs, what should be
a minimum.

On the other hand something like this seems to be crazy for normal
people, but some of you will understand that it is a lot of fun to run
those engine tournaments and probably we will continue a few months more.

Anyway we should stop before the men with the white gowns will come with the
straitjackets.>>

Best Regards
Heinz



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.