Author: Ed Schröder
Date: 04:28:56 10/24/04
I have put this article on my website for discusion and sharing information.
http://members.home.nl/matador/testing.htm
Ed
====================
Adventures with Fritz
This is an article about testing and some of the problems I encountered during
engine-engine matches using FRITZ as base software. It's my understanding this
article is a must-read for users who like to play these engine-engine matches
with Pro Deo. This article also can be important to my colleague chess
programmers because I don't know if these problems may also occur when testing
their own engine.
This article will be put on the CCC discussion board in the hope to create
awareness, to receive useful comments, ask other testers and chess programmers
either for confirmation or denial of the below listed problem.
--------------------------------------------------------------------------------
Methodology
Since 4-5 years I am using the eng-eng match technique as the final piece to
test the changes I make. During the first 3-4 years the eng-eng testing was done
under the REBEL DOS interface, but this testing was limited because it could
only play against itself. The moment I had made my engine available to run under
other interfaces I thought it would be an improvement to move to a different
eng-eng testing environment that allowed me to test against more opponents.
From the alternatives I choose for the FRITZ software mainly because of its
user-friendly eng-eng match software. I created a set of 100 balanced opening
positons and 4 fixed sparring engines (Fritz8, Shredder7, Junior8 and Hiars8)
and let them play on 4 PC's at various levels, each producing 200 games, thus
4x200 = 800 games in total.
Testing is done without any learning activated, no opening books, same hash
table size, same engine parameters, meaning: exclude all randomness that
possibly may influence the progress of a game. Re-running the test should simply
produce an equal result or something very close.
This procdure was repeated several times to ensure its reliability and without
any exception all of the replayed 800 game matches produced an acceptable error
margin between -1% and +1%. It seems the system was working and I had created
myself a reliable testing environment to test program changes, run the 800 game
eng-eng match to see if it would produce a higher match score. So far so good.
--------------------------------------------------------------------------------
Problems
During time I noticed something odd, that the match results against Shredder7
and Junior8 went down considerable and on the other hand the match score against
Hiarcs8 went up, also considerable, all of this as a pattern. This pattern
remained so constant it made me suspicious and so I ran the initial match again
and there it was, it produced a -3% match result, meaning a loss of 20 elo
points for no good reason. My test environment was not reliable anymore, Houston
there is a problem.
I double-checked all the settings I was using that could explain this sudden
fluctuation in score and found none, all the conditons were the same until I
noticed something there had been an unimportant change after all, that at a
certain moment I had set the main engine (the one that is loaded at program
start) on all 4 PC's to FRITZ8.
I couldn't believe this change could make any difference at all else it would
mean 1 or 2 of the engines is not correctly loaded, meaning entering the world
of bugs. I decided to find out nevertheless, after all I had no other clue than
this.
--------------------------------------------------------------------------------
The experiment
I took an older version (Rebel 12.00.01) and ran 3 exact 4x200=800 games
test-matches (time control 40/5) with the following exception:
Match-1, FRITZ8 loaded at program start.
Match-2, own engine loaded at program start (Shredder loaded with
Shredder, Junior with Junior, etc.)
Match-3, Pro Deo loaded at program start.
It should produce match scores within an error margin of -1% or +1% else
something serious is wrong with the testing technique itself which is either
related to bugs or to the fact that 800 games is still not enough to ensure a
-1% or +1% error margin. The results are telling and leave no room for
speculation, there is something wrong with the testing environment.
Match-1, FRITZ8 loaded at program start 38.1%
Match-2, own engine loaded at program start 40.8%
Match-3, Pro Deo loaded at program start. 42.8%
An unbelievable and unacceptable difference of 4.7% which corresponds with an
elo difference of more than 30 elo points depending on what engine is loaded at
program start.
--------------------------------------------------------------------------------
Where to go from here?
It's tempting to advice users to have Pro Deo loaded at program start all the
time (eng-eng and auto232) to ensure the best results but somehow this is an
unsatisfactory thing to say, it's more constructive to start searching for the
reasons behind and look for water-proofed solutions, hence I put this article on
the CCC forum for discussion. An interesting information for me would be to
receive the experiences of fellow programmers and testers, maybe things are
entirely Pro Deo related after all.
My conclusion so far is that I could not find any satisfactory explanation why
Hiarcs8, Junior8 and Fritz8 match scores fluctuate so much. There is a possible
reason for Shredder7, its settings are not correctly remembered, from time to
time Shredder7 uses "position learning" after all, no matter the fact it is
turned off. Other engines have this (settings) problem as well, for instance
Chess Tiger 15 starts with the Gambit style as default setting, when you change
it to Normal, exit and restart the program the Gambit style is active again.
--------------------------------------------------------------------------------
More ChessBase oddities
Other ChessBase oddities that are NOT related to this topic (which engine loaded
at program start) but general hints for accurate testing, the below listed
oddities are easily to overcome.
This article is based on my experiences with the Fritz7 interface, the Fritz8
interface might be a different story. My preference for the Fritz7 interface is
mainly because Fritz8 doesn't save Pro Deo's current personality right, Fritz7
does.
There sometimes is a problem Fritz7 starts with the wrong Pro Deo personality.
While the WB2UCI.ENG adaptor clearly states to use engine_X the Fritz7 interface
ignores this and starts another engine. The problem occurs about 10% of the
time. I have no idea if this problem still exist in the Fritz8 interface. The
cure is to exit and restart Fritz7. So always check the param.txt file to see if
a match is initialized well, see the Pro Deo FAQ for details.
This page took 0.02 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.