Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: FRITZ 5.32 VS. HIARCS 7.32 (120'/40 + 60'/20 + 30'), Comments welcome

Author: Ricardo Gibert

Date: 08:03:15 06/23/99

Go up one level in this thread


I have a problem with these engine vs engine testing that get posted here.  I'm
not trying to single you out on this.  You just happen to be the lastest such
post. The problem I have is: people draw the misleading conclusions from such
testing.  The misleading conclusion is that a "favorable" record of "this"
program over "that" program "implies" that "this" program is stronger than
"that" one.

To take a couple of examples among humans, should we conclude that M. Tal was a
stronger player that Fischer, because he had a +4 =5 -2 record against him?  Was
E. Geller a stronger player than Fischer, because he had a +5 =2 -4 record
against him?  Am I a stronger player than Jack Peters, because I have a +2 =0 -0
record him?  Peters outrates me by 300pts! I'm not even close!  Practically
everyone agrees Fischer was the stronger player! These scores reflect more about
luck and the relative idiosynchrosies of the respective players than anything
else.

In engine vs engine testing I think the idiosychrasies play an even greater part
in the results.  The point is, how an individual (program) performs against
another individual (program) is a very poor benchmark of relative strength.  A
better indicator is how well an individual performs against the general
population of players.  After about a hundred games we can then draw a much more
reasonable conclusion about relative strength.  A large number of games against
varied opposition is much more reliable.  People tend to draw the wrong
conclusions from such "testing".

What I want out of a chess playing program is a strong opponent that effectively
helps me improve and can be reliably used to analyze.  To this end, a much
better way to determine a programs strength would be to have it play on ICC
about 1 hundred games against titled players at a slow time control.  Compute
the average rating held by the computer and the standard deviation.  Then you
can get an idea of the programs strength.

To me the recent paderborn tournament does not tell you much of anything about
how strong the programs were.  Not enough games.  Unequal hardware.  Opponents
were not human.  When you purchase a chess playing program, your primary
interest is NOT how well it will do against other programs, but rather how well
it will play against YOU in practice, how useful it is in preparing you to play
your HUMAN opponent in your next tounament and how much you can learn from it.
The program that does best against strong human opponents is your best bet.

Having said all that I don't mean to imply that engine vs engine testing is
without value.  The results are not without interest and independent
significance.  I can imagine performing such testing can be quite fun.  Perhaps
akin to playing fantasy baseball.  Moreover, despite what I said about
Paderborn, I followed the event very closely.  The games were very interesting
and I drew my own conclusions about how computers play, but I have no idea which
is the stronger program.  I have a better idea of which program had the
strongest hardware!

As for what is a reasonable motive for the participants for playingin Pderborn:
1) It pays to advertise (especially when luck is on your side) 2) "Because it's
THERE" (like Mount Everest) 3) "My boss told me to go"



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.