Author: Robert Hyatt
Date: 17:05:51 11/21/97
Go up one level in this thread
On November 21, 1997 at 19:29:40, Ed Schröder wrote: >From: hyatt@crafty (Robert Hyatt) > >Posting taken from... > >>Newsgroups: rec.games.chess.computer > > >>MLC (Mc@email.com) wrote: >>: The information can be found at www.chessbase.com under their news section. >>: They have all of the games in PGN format. The games were run on P233's. > >>: I have not looked at the PGN gamescores myself. I don't understand. IF >>: Fritz is this strong according to these results, why such a poor showing at >>: Paris? IF it is due to a poor opening book, why don't they make a good >>: book? > >>: Chessbase says these positions show true strength of program since it >>: starts the game in equal positions. an test like this is objective they >>: say since it eliminates bad books. > >>: Can someone knowledgeable go to their site and check it out, and explain to >>: a novice like me? Is this true? > >>here's the low-down "skinny": > >>1. Any program could do poorly at Paris. There were so many strong programs >>that a bad result was within "one sigma" of what would be expected on any sort >>of standard deviation based on observed results. So doing poorly can happen. >>So can doing well. IE in Jakarta, Crafty was quite fortunate and finished >>near the top, probably higher than it should really have been expected to do, >>but again, this was within one sigma of what would be expected over several >>thousand such competitions. The odds of a poor program winning are low. >>However, the odds of the best program winning are also pretty low, when the >>difference between "best" and the middle of the pack is not super-large. > > >>2. Everyone tests program vs program differently. Ed, for example, >does not allow the same game to be played twice. Which penalizes a >>book learner that would repeat until the opponent learned how to not >>lose that line. The SSDF ignores this issue, which lets book learners >>do their thing, but which also allows someone to "cook" the book of >>another program and take advantage. > >Partly true. > >With the Rebel8 testing 'doubles' where not allowed. > >This year (for the first time) I allowed 'doubles' to measure the >performance of the Rebel9 book learner. > > wasn't intending to be critical of course. Just to show that trying to eliminate one problem (repeatedly losing to a cooked book) can cause another problem (failure to let a learner "learn"). Both have problems and advantages. >>3. SSDF ratings seem to be affected by a couple of programmers knowing >>something about the testing procedure. IE auto232 is not perfectly reliable >>and hangs are not unexpected. So most testers start a game with an auto232 >>timeout of N (I don't know what they use but let's take 50 minutes as an >>example) so that if a program doesn't move after this long, the game is >>aborted and not counted. > >SSDF has given me the N values they use and the ones they are using >are good so there is not such a problem. > >Furthermore AUTO232 never crashes on my Pc's. I know of *one* program (not yours) that will search for over one hour, in a 40/2hr time control, if the *first* move out of book is below -1.5... I won't mention names, but it is absolutely certain. If they don't use an N of 60 minutes or greater (actually 60 is too low) then this programmer has found a way to eliminate losses due to choosing a poor book line. You could say this is a way to attack a cooked book problem of course... Most everyone I have talked to mentions an occasional hang for no reason.. although I don't remember specifics about opponents. > > >>Suppose you were to modify your program's timing algorithm so that >>if, on the first move out of book (only on this move) the eval was >>below (say) -1.5, you simply went into a "deep think" for an hour >>or so? You could obviously justify this as trying to avoid losing >>right out of book. However, thinking for one hour overruns the 50 >>minute timeout limit, the game is aborted, and you don't lose. > >This was indeed a rumor of last year. > > >>I won't mention programs that do this, but have had this discussion >>a couple of times, so apparently it does happen. > >After this rumor occurred I passed the information to SSDF with the >question of course if this was true. They said no. But perhaps it's >best if SSDF comment this subject themselves. > >I think your information is wrong. However if this is happening or >has happened this is clearly done to cheat the SSDF guys. They are >too much experienced IMO not to discover such dirty tricks. It depends on how they test. This program *definitely* behaves as I have described. If they use a time-out value of an hour or less, it happens. If they use something like 2 hours, then they are safe. > > >>But it means you'd have to have some idea of how the SSDF was >>testing and what timeout interval they used. Is it dishonest? >>Hard to say, since it would avoid a cooked book line. But it >>would certainly mean that *we* could not reproduce the SSDF >>results exactly, because when playing A vs B manually, >>we wouldn't stop the game, as we could see that things were >>not hung. > > >>4. The Nunn test has been one of my favorite approaches for years. I have >>run many Crafty vs X matches like this. The point is to pick a position after >>N moves and then play X vs Y and then Y vs X from that same position. If >>either X or Y wins both, that is a clear result. If white wins both, or if >>black wins both, then the results cancel and we'd conclude that the position >>was simply won for one side or another. It takes the book out of the game, >>it takes "learning" out of the game. So programs that learn might do worse >>here than they would in normal competition. Programs that don't learn might >>do better here since there is never a chance for learning to affect the game. > > >>5. It is, of course, possible to cherry-pick positions that would make any >>program look better than any other. I don't believe this was done in the Nunn >>test, but it is possible. IE if you pick enough random positions, you will >>find some where your favorite wins against all programs. You keep these. You >>will find some where your favorite loses every game, because there is something >>about the opening it doesn't understand. You toss these out. The rest of >>the positions, are balanced more in your favor since you canned the ones it >>couldn't win against everyone, and it would be *possible* to hand-pick a set >>of positions that your favorite likes better than any other program. > > >>The bottom line: caveat emptor. Check out the SSDF results. >>Then check out the ChessBase site. Then check out the Rebel site. >>And the results are all conflicting. Because they might be measuring >>slightly different things... > >I don't think so. > >First of all the Rebel8 results of last year fitted perfectly in the >SSDF results that came later. I was looking at specific rebel vs X scores which were not real close to the SSDF totals in some cases. One notable case was rebel vs mchess, although I don't remember how badly they differed, but it seemed to be a direct result of the book issue. > >I have to wait the SSDF results for Rebel9 (they seem to come end of >next week) and we will see if results again will fit with the results >on the Rebel Home Page. > >And finally you can't compare the SSDF and the Rebel Home Page results >with the Chessbase site since the results on the Chessbase site are not >based on 40/2:00 tournament time control. > >- Ed Schroder -
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.