Author: Ed Schröder
Date: 16:29:40 11/21/97
From: hyatt@crafty (Robert Hyatt) Posting taken from... >Newsgroups: rec.games.chess.computer >MLC (Mc@email.com) wrote: >: The information can be found at www.chessbase.com under their news section. >: They have all of the games in PGN format. The games were run on P233's. >: I have not looked at the PGN gamescores myself. I don't understand. IF >: Fritz is this strong according to these results, why such a poor showing at >: Paris? IF it is due to a poor opening book, why don't they make a good >: book? >: Chessbase says these positions show true strength of program since it >: starts the game in equal positions. an test like this is objective they >: say since it eliminates bad books. >: Can someone knowledgeable go to their site and check it out, and explain to >: a novice like me? Is this true? >here's the low-down "skinny": >1. Any program could do poorly at Paris. There were so many strong programs >that a bad result was within "one sigma" of what would be expected on any sort >of standard deviation based on observed results. So doing poorly can happen. >So can doing well. IE in Jakarta, Crafty was quite fortunate and finished >near the top, probably higher than it should really have been expected to do, >but again, this was within one sigma of what would be expected over several >thousand such competitions. The odds of a poor program winning are low. >However, the odds of the best program winning are also pretty low, when the >difference between "best" and the middle of the pack is not super-large. >2. Everyone tests program vs program differently. Ed, for example, does not allow the same game to be played twice. Which penalizes a >book learner that would repeat until the opponent learned how to not >lose that line. The SSDF ignores this issue, which lets book learners >do their thing, but which also allows someone to "cook" the book of >another program and take advantage. Partly true. With the Rebel8 testing 'doubles' where not allowed. This year (for the first time) I allowed 'doubles' to measure the performance of the Rebel9 book learner. >3. SSDF ratings seem to be affected by a couple of programmers knowing >something about the testing procedure. IE auto232 is not perfectly reliable >and hangs are not unexpected. So most testers start a game with an auto232 >timeout of N (I don't know what they use but let's take 50 minutes as an >example) so that if a program doesn't move after this long, the game is >aborted and not counted. SSDF has given me the N values they use and the ones they are using are good so there is not such a problem. Furthermore AUTO232 never crashes on my Pc's. >Suppose you were to modify your program's timing algorithm so that >if, on the first move out of book (only on this move) the eval was >below (say) -1.5, you simply went into a "deep think" for an hour >or so? You could obviously justify this as trying to avoid losing >right out of book. However, thinking for one hour overruns the 50 >minute timeout limit, the game is aborted, and you don't lose. This was indeed a rumor of last year. >I won't mention programs that do this, but have had this discussion >a couple of times, so apparently it does happen. After this rumor occurred I passed the information to SSDF with the question of course if this was true. They said no. But perhaps it's best if SSDF comment this subject themselves. I think your information is wrong. However if this is happening or has happened this is clearly done to cheat the SSDF guys. They are too much experienced IMO not to discover such dirty tricks. >But it means you'd have to have some idea of how the SSDF was >testing and what timeout interval they used. Is it dishonest? >Hard to say, since it would avoid a cooked book line. But it >would certainly mean that *we* could not reproduce the SSDF >results exactly, because when playing A vs B manually, >we wouldn't stop the game, as we could see that things were >not hung. >4. The Nunn test has been one of my favorite approaches for years. I have >run many Crafty vs X matches like this. The point is to pick a position after >N moves and then play X vs Y and then Y vs X from that same position. If >either X or Y wins both, that is a clear result. If white wins both, or if >black wins both, then the results cancel and we'd conclude that the position >was simply won for one side or another. It takes the book out of the game, >it takes "learning" out of the game. So programs that learn might do worse >here than they would in normal competition. Programs that don't learn might >do better here since there is never a chance for learning to affect the game. >5. It is, of course, possible to cherry-pick positions that would make any >program look better than any other. I don't believe this was done in the Nunn >test, but it is possible. IE if you pick enough random positions, you will >find some where your favorite wins against all programs. You keep these. You >will find some where your favorite loses every game, because there is something >about the opening it doesn't understand. You toss these out. The rest of >the positions, are balanced more in your favor since you canned the ones it >couldn't win against everyone, and it would be *possible* to hand-pick a set >of positions that your favorite likes better than any other program. >The bottom line: caveat emptor. Check out the SSDF results. >Then check out the ChessBase site. Then check out the Rebel site. >And the results are all conflicting. Because they might be measuring >slightly different things... I don't think so. First of all the Rebel8 results of last year fitted perfectly in the SSDF results that came later. I have to wait the SSDF results for Rebel9 (they seem to come end of next week) and we will see if results again will fit with the results on the Rebel Home Page. And finally you can't compare the SSDF and the Rebel Home Page results with the Chessbase site since the results on the Chessbase site are not based on 40/2:00 tournament time control. - Ed Schroder -
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.