Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSDF in rgcc

Author: Robert Hyatt

Date: 17:05:51 11/21/97

Go up one level in this thread


On November 21, 1997 at 19:29:40, Ed Schröder wrote:

>From: hyatt@crafty (Robert Hyatt)
>
>Posting taken from...
>
>>Newsgroups: rec.games.chess.computer
>
>
>>MLC (Mc@email.com) wrote:
>>: The information can be found at www.chessbase.com under their news section.
>>: They have all of the games in PGN format. The games were run on P233's.
>
>>: I have not looked at the PGN gamescores myself. I don't understand. IF
>>: Fritz is this strong according to these results, why such a poor showing at
>>: Paris? IF it is due to a poor opening book, why don't they make a good
>>: book?
>
>>: Chessbase says these positions show true strength of program since it
>>: starts the game in equal positions.  an test like this is objective they
>>: say since it eliminates bad books.
>
>>: Can someone knowledgeable go to their site and check it out, and explain to
>>: a novice like me? Is this true?
>
>>here's the low-down "skinny":
>
>>1.  Any program could do poorly at Paris.  There were so many strong programs
>>that a bad result was within "one sigma" of what would be expected on any sort
>>of standard deviation based on observed results.  So doing poorly can happen.
>>So can doing well.  IE in Jakarta, Crafty was quite fortunate and finished
>>near the top, probably higher than it should really have been expected to do,
>>but again, this was within one sigma of what would be expected over several
>>thousand such competitions.  The odds of a poor program winning are low.
>>However, the odds of the best program winning are also pretty low, when the
>>difference between "best" and the middle of the pack is not super-large.
>
>
>>2.  Everyone tests program vs program differently.  Ed, for example,
>does not allow the same game to be played twice.  Which penalizes a
>>book learner that would repeat until the opponent learned how to not
>>lose that line.  The SSDF ignores this issue, which lets book learners
>>do their thing, but which also allows someone to "cook" the book of
>>another program and take advantage.
>
>Partly true.
>
>With the Rebel8 testing 'doubles' where not allowed.
>
>This year (for the first time) I allowed 'doubles' to measure the
>performance of the Rebel9 book learner.
>
>

wasn't intending to be critical of course.  Just to show that trying to
eliminate one problem (repeatedly losing to a cooked book) can cause
another
problem (failure to let a learner "learn").  Both have problems and
advantages.



>>3.  SSDF ratings seem to be affected by a couple of programmers knowing
>>something about the testing procedure.  IE auto232 is not perfectly reliable
>>and hangs are not unexpected.  So most testers start a game with an auto232
>>timeout of N (I don't know what they use but let's take 50 minutes as an
>>example) so that if a program doesn't move after this long, the game is
>>aborted and not counted.
>
>SSDF has given me the N values they use and the ones they are using
>are good so there is not such a problem.
>
>Furthermore AUTO232 never crashes on my Pc's.


I know of *one* program (not yours) that will search for over one
hour, in a 40/2hr time control, if the *first* move out of book is
below -1.5...  I won't mention names, but it is absolutely certain.
If they don't use an N of 60 minutes or greater (actually 60 is too
low) then this programmer has found a way to eliminate losses due to
choosing a poor book line.  You could say this is a way to attack a
cooked book problem of course...

Most everyone I have talked to mentions an occasional hang for no
reason..  although I don't remember specifics about opponents.

>
>
>>Suppose you were to modify your program's timing algorithm so that
>>if, on the first move out of book (only on this move) the eval was
>>below (say) -1.5, you simply went into a "deep think" for an hour
>>or so?  You could obviously justify this as trying to avoid losing
>>right out of book.  However, thinking for one hour overruns the 50
>>minute timeout limit, the game is aborted, and you don't lose.
>
>This was indeed a rumor of last year.
>
>
>>I won't mention programs that do this, but have had this discussion
>>a couple of times, so apparently it does happen.
>
>After this rumor occurred I passed the information to SSDF with the
>question of course if this was true. They said no. But perhaps it's
>best if SSDF comment this subject themselves.
>
>I think your information is wrong. However if this is happening or
>has happened this is clearly done to cheat the SSDF guys. They are
>too much experienced IMO not to discover such dirty tricks.

It depends on how they test.  This program *definitely* behaves as I
have described.  If they use a time-out value of an hour or less, it
happens.  If they use something like 2 hours, then they are safe.


>
>
>>But it means you'd have to have some idea of how the SSDF was
>>testing and what timeout interval they used.  Is it dishonest?
>>Hard to say, since it would avoid a cooked book line.  But it
>>would certainly mean that *we* could not reproduce the SSDF
>>results exactly, because when playing A vs B manually,
>>we wouldn't stop the game, as we could see that things were
>>not hung.
>
>
>>4.  The Nunn test has been one of my favorite approaches for years.  I have
>>run many Crafty vs X matches like this.  The point is to pick a position after
>>N moves and then play X vs Y and then Y vs X from that same position.  If
>>either X or Y wins both, that is a clear result.  If white wins both, or if
>>black wins both, then the results cancel and we'd conclude that the position
>>was simply won for one side or another.  It takes the book out of the game,
>>it takes "learning" out of the game.  So programs that learn might do worse
>>here than they would in normal competition.  Programs that don't learn might
>>do better here since there is never a chance for learning to affect the game.
>
>
>>5.  It is, of course, possible to cherry-pick positions that would make any
>>program look better than any other.  I don't believe this was done in the Nunn
>>test, but it is possible.  IE if you pick enough random positions, you will
>>find some where your favorite wins against all programs.  You keep these.  You
>>will find some where your favorite loses every game, because there is something
>>about the opening it doesn't understand.  You toss these out.  The rest of
>>the positions, are balanced more in your favor since you canned the ones it
>>couldn't win against everyone, and it would be *possible* to hand-pick a set
>>of positions that your favorite likes better than any other program.
>
>
>>The bottom line:  caveat emptor.  Check out the SSDF results.
>>Then check out the ChessBase site.  Then check out the Rebel site.
>>And the results are all conflicting.  Because they might be measuring
>>slightly different things...
>
>I don't think so.
>
>First of all the Rebel8 results of last year fitted perfectly in the
>SSDF results that came later.

I was looking at specific rebel vs X scores which were not real close to
the SSDF totals in some cases.  One notable case was rebel vs mchess,
although
I don't remember how badly they differed, but it seemed to be a direct
result
of the book issue.


>
>I have to wait the SSDF results for Rebel9 (they seem to come end of
>next week) and we will see if results again will fit with the results
>on the Rebel Home Page.
>
>And finally you can't compare the SSDF and the Rebel Home Page results
>with the Chessbase site since the results on the Chessbase site are not
>based on 40/2:00 tournament time control.
>
>- Ed Schroder -



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.