Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: General Objection Against CEGT Stats

Author: Dagh Nielsen

Date: 06:29:13 12/07/05

Go up one level in this thread


On December 07, 2005 at 08:58:08, Rolf Tueschen wrote:

>On December 07, 2005 at 08:43:29, Dagh Nielsen wrote:
>
>>On December 07, 2005 at 08:04:54, Rolf Tueschen wrote:
>>
>>>If we think about a testing design we dream of as much data we could get because
>>>we know that statistical significance has something to do with HIGH numbers of
>>>trials, games or data. Please believe me that I dont want to bash all sorts of
>>>activities in the testing hobby. This is just a plea to care and to be attentive
>>>of what one is doing.
>>>
>>>Say you (general you) have three, just these three, top engines and 500 engines
>>>on the free market with different strength.
>>>
>>>Could you just do the testing the way it's done on CEGT? I have serious doubts.
>>>
>>>Look at this: say these three top acts are incredibly stronger in chess strength
>>>than all th other 500 (which is apparently NOT the case in CEGT!) then what you
>>>are testing in such little 20 or so games matches? Are you really testing chess
>>>strength? I dont think so.
>>>
>>>In my view the following is tested. How well the top engines solve the different
>>>technical problems during tournament play. Just see the 14 SHREDDER losses in
>>>the 300 rating. Compare it with FRITZ.
>>>
>>>I dont want to be boring with mathematical calculations but let me say it in
>>>speech.
>>>
>>>The more opponents of relatively weaker strength you match with three or say
>>>five top programs, the more irrelevant technical details or also chess depending
>>>singularities (exceptions in the game) sum up and influence your ranking.
>>>
>>>You must decide what you want to get. You are not interested in the testing of
>>>the top programs. You want to get a ranking of the many free engines or amateurs
>>>at least. Isnt it?
>>>
>>>I say that you cant compare these many with the top three. You could better test
>>>without them. Because the assumption is a delusion that you now by using the
>>>comparison with the top very few you get a reasonable "Elo" or whatever you call
>>>it for the "little" engines. Believing into such a mechanism is the same error
>>>type the SSDF people made for years. You remember. They once "calibrated" their
>>>tests with some (!) few (!) games against IM or Swedish masters. At the stoneage
>>>times of CC. And then later they somehow wriggled around with this calibration
>>>to give a reasonably looking Elo figure. On the base of the games of these
>>>masters against MEPHISTO I dont know more.
>>>Such a testing is absolutely nonsense.
>>>
>>>In other words. You never know exactly what you are really testing. Here in CEGT
>>>it would be way better if you tested among the 500 amateurs. Then you will get a
>>>ranking over time. But to test how a new engine like Rybka would do against
>>>SHREDDER or FRITZ or CHESSMASTER, you must create a different testing. For that
>>>question it only is disturbing noise to watch all the results of these 500
>>>engines.
>>>
>>>Please ask if something is not understandable. I wrote this to prevent that
>>>later after enormous attempts the whole results would be criticised. That would
>>>be a pity for all the very motivated fans of our hobby CC. So please ask before
>>>you go on tangents because you think that I am nuts with my critic.
>>
>>
>>Hi, interesting post :-) Some comments:
>>
>>(1) As I see it, your critique is in essence more directed towards the very idea
>>of an ELO system than towards the CETG way of testing. The ELO idea is that you
>>can measure (and predict!) performance very accurately by using results from
>>matches between opponents of varying strength. That is the beauty and underlying
>>assumption: The predictive value. Jeff Sonas made some statistical analysis of
>>the ELO table and concluded that they tended to "punish" a player when playing
>>against far weaker opponents, and instead he suggested a simple linear formula
>>that should give a better predictive value. But still, the philosphy behind his
>>formula is very much the same as that behind the ELO table, and the difference
>>between the two versions is almost negligable. I am not sure which version CETG
>>uses, but personally I would prefer the Sonas version, and that applies to FIDE
>>rating also.
>>
>>In light of the above observation, I am perfectly happy with the CETG ranking
>>list, I just don't read something into it that is not given. CETG tests how well
>>engines perform in an environment of a multitude of different competitors, and
>>that's it.
>>
>>(2) Something to back up your critique: I noticed an interesting phenomenon
>>during the Rybka testing so far: That Fritz 9 seems to perform relatively well
>>against this Engine (notice the Fritz 9 jump in the blitz list compared to Fruit
>>2.2.1). This observation was also made in chat already on the first day of
>>public testing on playchess. It may be wrong, but for argument's sake, let's
>>assume it is correct. THEN there is a "problem" with the CETG list, in that it
>>involves a different environment than the "elite" environment seen on, for
>>instance, playchess. And hence, the data from playing against weaker engines
>>*may* polute the predictive value in an elite environment (again, the ELO
>>assumption is that such polution can be ignored).
>>
>>That leads me to ask you: What do you want from a ranking list? On the extreme
>>side, if you are only interested in performance between a group 5 enginges, by
>>all means conduct a 10000 games test tournament between those 5 engines to get
>>the most precise predictive value. Or even, restrict your test to 2 top
>>engines!?
>>
>>But what if you want an "allround" measurement, ie., a predictive indicator for
>>results against a "random" environment? Then CETG fits the bill perfectly in my
>>opinion. It should just be read while keeping in mind its restricted
>>applicability to "special cases" like elite environnments where only a handful
>>of engines battle it out for the crown.
>>
>>BTW, this issue is also relevant when discussing the practice of conducting
>>closed elite tournaments between humans vs. open tournaments including weaker
>>opponents. Some top human players have a style suitable for getting high scores
>>against weaker opponents, while others have a more solid style. Should the ELO
>>rewards for being able to achieve astronomical scores against weaker opponents
>>be discarded? In analogy, that's what your critique seems to imply in the CETG
>>case.
>>
>>Regards,
>>Dagh Nielsen
>
>
>Thanks very much. You  gave even more points to discuss! It's impossible to
>respond in a fly like I will do it right now. But please have tolerance for such
>restrictions. I will study the whole message and then comment later if there is
>a real dissent to notice. I doubt it because your approach is also so elevated
>that it contains so many valuable statements that it would be worthwhile to
>focus on a single point.
>
>But still let me confrontate you - on your level it will be a real suspense how
>you react - with a single aspect that was also the main point in my old critic
>against SSDF and all the following results.
>
>Would you see a difference or not in the following maths?
>
>I make matches between different engines over 20 games and in the end I count
>the results together. Or second version: I make 1000 game matches between all
>two different engines.
>
>My point at the time in 1997 and now is this: if you do it in the first variant
>you sum up errors (artefacts) while in the second variant you would probably
>receive much more insignificant differences overall. It was always my opinion
>(knowledge) that we would discuss differences that were NOT existing in real but
>only because of this particular and very practically reasonable testing in
>Sweden, where it were amateurs who played these many games by hand. Without
>autoplayer.
>
>What is your opinion? :)

Hi, thank you for your kind words :-)

I think I agree with you in one aspect, that one should be careful not to make
too long "lines" of conclusions based on test sets such as these. One example,
engines may improve by, say, 400 elo points relative to earlier generations of
engines, when tested in an engine only environment. Can we conclude that these
engines will also perform 400 elo points better against humans? As far as I
know, nobody knows. It is not unlikely that these two different generations of
engines would be more evenly helpless against anti-computer strategies employed
by humans, and the second generation may hence only perform, say, 200 elo points
better against humans. I agree that SSDF has this weakness if they want to claim
that their elo list is good for predicting performance against humans.

If I understand you right, then we could call some "engine vs. engine"
advantages (like better tactical search) "artefacts" that are less relevant in
an "engine vs. human" environment where it is supposed that humans will steer
the game into positional waters.

Similarly, one may be able to identify "ability to crush weaklings" artifacts
that would not be very relevant in engine-engine matches between two top engines
of about equal strength. In fact, I think such an artifact has already been
discovered on this board: Rybka's weak endgame play. Point would be that against
weaker opponents, Rybka will win the game before the endgame, and hence it will
perform relatively better against weaker opponents than against the handful of
top engines. This is just speculation of course.

I think you can have three things at the same time:

(1) A ranking list such as CEGT that tests engines in a "random" environment.

(2) Dedicated tests to limited or specific environments, like only two engines,
a handful of top engines, or maybe even some humans and some engines (not very
likely :-). (BTW, CEGT does offer information about concrete engine vs. engine
results.)

(3) Concrete discussion of a particular engine's strengths and weaknesses
("artefacts"?).

The latter is difficult to quantify. Only results are quantifiable ;-)

Regards,
Dagh Nielsen



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.