Author: Mike Byrne
Date: 09:53:10 10/24/04
Go up one level in this thread
On October 24, 2004 at 07:28:56, Ed Schröder wrote: >I have put this article on my website for discusion and sharing information. > >http://members.home.nl/matador/testing.htm > >Ed > >==================== > >Adventures with Fritz > >This is an article about testing and some of the problems I encountered during >engine-engine matches using FRITZ as base software. It's my understanding this >article is a must-read for users who like to play these engine-engine matches >with Pro Deo. This article also can be important to my colleague chess >programmers because I don't know if these problems may also occur when testing >their own engine. > >This article will be put on the CCC discussion board in the hope to create >awareness, to receive useful comments, ask other testers and chess programmers >either for confirmation or denial of the below listed problem. > >-------------------------------------------------------------------------------- > >Methodology > >Since 4-5 years I am using the eng-eng match technique as the final piece to >test the changes I make. During the first 3-4 years the eng-eng testing was done >under the REBEL DOS interface, but this testing was limited because it could >only play against itself. The moment I had made my engine available to run under >other interfaces I thought it would be an improvement to move to a different >eng-eng testing environment that allowed me to test against more opponents. > >From the alternatives I choose for the FRITZ software mainly because of its >user-friendly eng-eng match software. I created a set of 100 balanced opening >positons and 4 fixed sparring engines (Fritz8, Shredder7, Junior8 and Hiars8) >and let them play on 4 PC's at various levels, each producing 200 games, thus >4x200 = 800 games in total. > >Testing is done without any learning activated, no opening books, same hash >table size, same engine parameters, meaning: exclude all randomness that >possibly may influence the progress of a game. Re-running the test should simply >produce an equal result or something very close. > >This procdure was repeated several times to ensure its reliability and without >any exception all of the replayed 800 game matches produced an acceptable error >margin between -1% and +1%. It seems the system was working and I had created >myself a reliable testing environment to test program changes, run the 800 game >eng-eng match to see if it would produce a higher match score. So far so good. > >-------------------------------------------------------------------------------- > >Problems > >During time I noticed something odd, that the match results against Shredder7 >and Junior8 went down considerable and on the other hand the match score against >Hiarcs8 went up, also considerable, all of this as a pattern. This pattern >remained so constant it made me suspicious and so I ran the initial match again >and there it was, it produced a -3% match result, meaning a loss of 20 elo >points for no good reason. My test environment was not reliable anymore, Houston >there is a problem. > >I double-checked all the settings I was using that could explain this sudden >fluctuation in score and found none, all the conditons were the same until I >noticed something there had been an unimportant change after all, that at a >certain moment I had set the main engine (the one that is loaded at program >start) on all 4 PC's to FRITZ8. > >I couldn't believe this change could make any difference at all else it would >mean 1 or 2 of the engines is not correctly loaded, meaning entering the world >of bugs. I decided to find out nevertheless, after all I had no other clue than >this. > >-------------------------------------------------------------------------------- > >The experiment > >I took an older version (Rebel 12.00.01) and ran 3 exact 4x200=800 games >test-matches (time control 40/5) with the following exception: > >Match-1, FRITZ8 loaded at program start. >Match-2, own engine loaded at program start (Shredder loaded with > Shredder, Junior with Junior, etc.) >Match-3, Pro Deo loaded at program start. > >It should produce match scores within an error margin of -1% or +1% else >something serious is wrong with the testing technique itself which is either >related to bugs or to the fact that 800 games is still not enough to ensure a >-1% or +1% error margin. The results are telling and leave no room for >speculation, there is something wrong with the testing environment. > > Match-1, FRITZ8 loaded at program start 38.1% > Match-2, own engine loaded at program start 40.8% > Match-3, Pro Deo loaded at program start. 42.8% > >An unbelievable and unacceptable difference of 4.7% which corresponds with an >elo difference of more than 30 elo points depending on what engine is loaded at >program start. > >-------------------------------------------------------------------------------- > >Where to go from here? > >It's tempting to advice users to have Pro Deo loaded at program start all the >time (eng-eng and auto232) to ensure the best results but somehow this is an >unsatisfactory thing to say, it's more constructive to start searching for the >reasons behind and look for water-proofed solutions, hence I put this article on >the CCC forum for discussion. An interesting information for me would be to >receive the experiences of fellow programmers and testers, maybe things are >entirely Pro Deo related after all. > >My conclusion so far is that I could not find any satisfactory explanation why >Hiarcs8, Junior8 and Fritz8 match scores fluctuate so much. There is a possible >reason for Shredder7, its settings are not correctly remembered, from time to >time Shredder7 uses "position learning" after all, no matter the fact it is >turned off. Other engines have this (settings) problem as well, for instance >Chess Tiger 15 starts with the Gambit style as default setting, when you change >it to Normal, exit and restart the program the Gambit style is active again. > >-------------------------------------------------------------------------------- > >More ChessBase oddities > >Other ChessBase oddities that are NOT related to this topic (which engine loaded >at program start) but general hints for accurate testing, the below listed >oddities are easily to overcome. >This article is based on my experiences with the Fritz7 interface, the Fritz8 >interface might be a different story. My preference for the Fritz7 interface is >mainly because Fritz8 doesn't save Pro Deo's current personality right, Fritz7 >does. > >There sometimes is a problem Fritz7 starts with the wrong Pro Deo personality. >While the WB2UCI.ENG adaptor clearly states to use engine_X the Fritz7 interface >ignores this and starts another engine. The problem occurs about 10% of the >time. I have no idea if this problem still exist in the Fritz8 interface. The >cure is to exit and restart Fritz7. So always check the param.txt file to see if >a match is initialized well, see the Pro Deo FAQ for details. You performed much greater detail investigative than I, but I have always surmized engine vs engine matches under Chessbase/Fritz GUI was somehow flawed when using the wb2uci adapter.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.