Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Proposal: New testing methods for SSDF (1)

Author: Bruce Moreland

Date: 10:34:55 04/13/98

Go up one level in this thread



On April 13, 1998 at 10:42:39, Jeroen Noomen wrote:

I have some agreements, but a few strong disagreements.

>There ar however a few points that should be cleared: If a neutral
>organisation
>like SSDF are testing chessprograms, the conditions for these tests
>should be
>as equal as possible. That means: If you present results based on
>Pentium 200
>MMX, these should be the SAME Pentiums, with the same configuration.

Yes, they should try to run similar configurations if they can.

>Another point: the ChessBase autoplayer.

I doubt that their autoplayer cheats.

Please correct me if I am wrong, but as far as I can tell, all of you
guys test like crazy against each other.  The reason you can do it is an
accident -- by supporting automatic chessboards you unwittingly produced
a standard for data interchange.  The Donninger autoplayer uses this
standard in a different way that was intended, and suddenly there is an
autoplayer, and you guys all have a half dozen machines or more going in
your respective labs all the time.

But if Fritz has an autoplayer that works with all of the commercial
programs, but the autoplayer that everyone else has access to does not
work with Fritz, then Fritz gains an advantage.

It could be an offensive advantage, meaning that they could book against
the rest of you, but it doesn't follow that this advantage has to be
used.

They also gain a defensive advantage, in that you guys can't book
against them as easily.

Same thing with the special book, this is another defensive advantage.

My guess is that the Fritz guys were thinking defensively -- they wanted
to make sure they didn't run on crappy machines, they wanted to make
sure that nobody booked up against them, and they used a large book to
try to make sure that nobody could learn effectively against them.

But a defensive advantage is still an advantage.  So something is wrong,
and people are complaining.  They are complaining by suggesting too much
(cheating), but a complaint of some sort is justified, as long as it's a
complaint *to* the SSDF guys and not *about* them.

>I have the feeling that it was wrong from SSDF to accept this way of
>testing,
>WITHOUT consulting other programmers if they agree or not. A great risk
>is
>that now programmers are removing the AUTO232 software, or 'worse': are
>starting to make their own autoplayer. This leads to 6 different
>autoplayers
>(or more) and the results of SSDF would be getting more and more
>unclear.
>Furthermore: Instead of taking time and efforts to improve the playing
>strength
>and features of a program, all programmers have to spend a huge amount
>of
>time in autoplayers, booktuning, booklearners and so on. Inevitable
>conclusion:
>The real playing strength improvements are becoming less and less,
>instead
>a lot of 'statistical imrpovements' are made. F.e. who is the best in
>copying
>won games to gain more Elo-points. (Note that older programs do not have
>any
>defence to such treatments. If you take program A without learner and
>the
>very same program with a learner, than the last one will win, although
>it is the
>same program and absolutely not stronger at all. But statistically
>everybody will
>say that the second program is stronger).

Right.  You guys are in a different kind of arms race now.  Someone has
taken their program out of the autoplayer pool.  And many of you are
finding ways of taking advantage of the inadequacies of your opponents.

The SSDF list is a computer vs computer list, and shouldn't be
considered to reflect strength against people.  This has always been
true, and is more true as everyone figures out new ways to hose computer
opponents.

Maybe the SSDF shouldn't accept proprietary autoplayers.

>Proposal 1:  The universal SSDF openingbook
>------------------------------------------------------------------
>
>* Out of - let's say 500,000 - grandmaster games the SSDF makes a
>universal
>   openingbook.
>* This openingbook will not be published or made available to the chess
> programmers.
>* All tested programs will have to use the universal book, all testgames
>and
>  matches will be played only with this book.

This discriminates against those who spend energy making good books.
The SSDF list is used as a comparison tool by customers.  It makes no
sense to give program A a rating of 2550 and program B a rating of 2520
if program B has a wonderful book, which is not considered if you use a
universal book, and program A has a really crappy book, which is also
not considered.  The customer would buy program A and it would play its
bad lines and they would be disappointed.

This would have the effect of reducing opening book innovation.  Those
with bad books can keep shipping them forever, and those with good books
have their rating artificially lowered.

This is all bad for the customer.

>* Each match consists of a predefined number of games, let's say 40. All
>matches
>  should consist of 40 played games.

Great.  I like the idea of long matches, because learners can learn
predictably, although it should be alright to exit the programs between
games, because that is what customers do.

>* Doubles are not counted.

No way.  You are saying that it is not allowed for a participant to play
certain moves that are legal given the laws of chess, otherwise the game
is aborted.  This is not chess.

If I beat you, it is absolutely your responsibility to adapt your play.
If your program will not adapt its play, tough luck, you deserve to lose
rating points.

This is how humans play.  Kasparov played the Sicilian Dragon vs Anand
and did great.  He did the same thing again later in the match.  Nobody
cried foul about this, it was up to Anand to change his line if it
didn't work the first time.  Kasparov had an unsuccessful game against
the open Ruy Lopez (Spanish), and he changed his line and won in
spectacular fashion.  Nobody complained about this, either, it was up to
Anand to be prepared to deal with innovations when facing the same line
a second time.

If computers can't do the same thing, well, we have a ways to go, and
this is just as valid a field for research as "engine strength".

Learners are good for customers.  Factoring them out of the SSDF rating
list would tend to discourage innovation in this area, would reward
those who have failed to implement this code, and would punish those who
have spent time on it.

This would be bad for the customer.

>* The configuration should be exactly the same for both opponents. A
>standard
>  can be agreed upon, f.e. Pentium 200 MMX with 64 MByte of memory.

Assuming they can get it, sure.  I don't see why someone should be able
to demand that they get a certain configuration, if others don't also
get the same configuration.

That Fritz performs better with bigger hash tables, and less well with
smaller ones, is a design decision may by the Fritz guys, and there are
negative tradeoffs involved in any design decision, for instance, you
might not get as much memory as you want.

>Under these conditions older programs can participate as well, as the
>results of
>improved openingbooks, bookkilling, and booklearners are ruled out.
>Although a
>disadvantage could be that the new programs - with learners - could
>learn which
>lines in the new book are suitable and which are not.

The new programs have an advantage over the old ones.  They are
stronger, since they have learning.  This should be reflected in the
rating list even though the advantage is more pronounced against dumb
programs than it is against humans.

The SSDF list doesn't compare expected performance against humans, it
compares players that are in a big pool together.

Saying that you can't do learning because the old ones can't do learning
is just as awful as saying that everyone has to run on a 386/20 with 4mb
RAM because the old ones had to do this.

>One can argue that a universal openingbook violates the fact that the
>openingbook
>is a part of the chessprogram. I agree to that, but I want to point out
>that this
>testmethod is meant to compare playing strength of chess programs, not
>opening
>books.

Who says?  You can't get to the top of the list, sit there for a while,
then complain that any innovation someone uses to kick you out of there
is unfair, and try to freeze the clock at the point where you were on
top.

I'm sorry this sounds so harsh, but we need to innovate in this field,
not declare any new innovation to be unfair and illegal.

Let us move on, not freeze the clock at 1996.

>Proposal 2:  50 openingpositions, each program plays one game with White
>and
>                  one with Black in every openingposition.

This factors out opening book choices.  If my program doesn't play
closed positions well, and its book is well organized, and therefore I
avoid them, this is a design decision made by me, and this will manifest
in the way the program performs if the whole program is allowed to be
used.

The programs should be tested as a whole, since the customer buys the
whole program.

Ideally you want to play every opening well, but if there is some line
you know you need to avoid, you should be able to try to avoid it, and
it is up to your opponent to try to get you to play that line, not the
SSDF.

>This way we can provide a ratinglist that is free from influences caused
>by booktricks, giving a good example of the allround playing strength of
>a chessprogram. Also statistics can be made which positions a program
>plays well and which not.

This last sentence is interesting, but if I know that my program will do
something dumb in one line, I should be able to try to avoid it.

This might be a good idea for a distinct list.

>I want to stress the fact that sometimes people forget what an enormous
>job our
>chess friends in Sweden are doing for the sake of computer chess. This
>is all
>done by volunteers, free of charge and devoting parts of their lives
>just to give
>people insights in the rankings of chess programs. A great reward should
>go to
>them, because nobody asked for this Elolist and still they are testing
>chess
>programs, for the benefit of programmers, chess enthousiasts and people
>who
>are needing information for buying a chess program.

Yes.

>In the meantime, however, it is also necessary to ask ourselves whether
>the
>recent developments are in the interest of computerchess or not. Maybe
>other
>testmethods are necessary. I would like to ask all people involved to
>think about
>this idea. All suggestions are welcome.

I think the labs full of autoplayers learning book lines are a very bad
thing, since what comes out of these is increased rating list
performance that will have little or no impact on the performance of the
program against the customer.

Opening book changes that are designed to defeat other programs also
have little bearing upon the customer, but it is hard to isolate these,
and as I have said a couple of times, you're not going to get an
accurate rating vs humans by testing against computers anyway.

I think that book learning, position learning, and other kinds of
learning are good for the customer, because they result in increased
performance against the customer.

bruce



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.