Computer Chess Club Archives


Search

Terms

Messages

Subject: Test Environment Integrity

Author: Keith Kitson

Date: 04:49:40 04/18/99


To all Chess Computer Enthusiasts,

For some time now I have been watching the relative testing environments and
contemplating the amount of information used to describe the conditions under
which chess programs are tested against other programs.  I now feel it is time
to consider the variances that can affect the results of any chess program, that
remain unspecified.

We are now at the stage where smaller increases in chess program strength are
being experienced as the chess programmers hit a learning curve which is getting
shallower in achievement, at the present time.

These small increases in strength can be negated by variations in hardware and
operating system  much more easily than in the past.  This could mean that a
program could very well perform worse than its predecessor as the
hardware/operating system combination used, cancels out the benefits of code
improvements.  But as the categories of hardware are amalgamated (or grouped for
convenience) and no operating system details are being quoted, the inferior
hardware results are hidden in the general category available to record results
for any specic chess program.

With this in mind I feel we need to be a lot more specific about how we specify
the environments under which chess program results are quoted, otherwise we are
in danger of quoting some program results under inferior environments alongside
superior environments and losing the differences forever by incorporating them
into the same category of results.

I can understand the need for chess media editors to keep things simple which
helps with layout, reduces number of pages, etc. but I reckon we have reached a
point at which we need to be more specific in these areas.  If we are not then
there is a greater danger that hard earned small program code improvements do
not show their true potential in the market place and code authors become
disillusioned, and give up the upgrading process altogether.  This would be a
sad loss for chess programming in general.

To illustrate my point further a number of years ago Novag Super forte was
showing poorer results on some modes in the Swedish lists (if my memory serves
me well it was do with the setting of the number of ply depth that the operator
could set for the selective search value).  Once this was determined it was
decided to discount (separate?) the weaker mode from the results list as it was
clearly inferior, and not representative of the stronger modes.  This problem
came about because results from varying modes, and variances in program settings
were rolled into one category of results (pulling the better results obtained in
the stronger mode down to the worse results end), when in fact we had at least
two clear categories that results should have been recorded under, one category
clearly inferior to the other.

I think we have a parallel now in that, due to the smaller increases in program
strength, we have sufficient variations in Hardware, Operating Systems, and
chess program internal settings, being rolled into one category for reporting of
results, that all poorer results are bringing the better results down, to a
point where code improvements are all but negated.

What perhaps is more worrying is that the great majority may not be aware of the
affect that a change in operating environment can have over the playing ability
of any specific chess program, for any given fixed setup for that specific
program.

So results are genuinely being submitted for programs with the following details
largely unquoted:

Choice of Opening Book
Which opening book is being used? ( a massive 7.7Mb book is not necessarily
stronger - Fritz programs power book for instance) The book supplied as standard
by the program author is normally the book that the author expects to be used
for best results.  I see very little qualification as to which opening book is
being used, we cannot just assume that all results have used the strongest
opening book.  The details must be quoted.

Style of Opening Book
Many programs on the market place have styles of opening book which can affect
the strength of the program.  If the style is not quoted with the results we
cannot naturally assume only the strongest opening book style was used for all
test results.

Program Learning Function
Whether the program has been playing for some time with learning enabled,
whether certain lines will be closed off by the program as it has lost using
these lines and considers that, under its evaluation algorithm, these lines are
now inferior (rightly or wrongly).  If the program is re-installed it is
possible that this learning file is ovewritten which leaves the program open to
losing with the same weaknesses it had previously marked to avoid.  Conversely
if the learning facility is switched off, the program is more likely to lose two
replica games by the same error, as it is not learning from previous games.

Program Playing style
Style of play (e.g. normal, aggressive, solid, active etc).  Greater influence
can be administered here than with some of the other criteria mentioned

Ply Depth Setting
5. Depth of selective search, where this can be adjusted, has a strong influence
on the results of a program. If this is not quoted then how can we be sure that
any particular tester has taken this critrium into account and adjusted for
optimised play.  In fact if ply depth setting is not mentioned how do we know if
the tester is aware of the adjustment or its influence on the playing strength
of the program?

Other Internal Program Adjustments
6. Any other adjustments from the strongest settings that may have an affect on
the playing strength of the program. (e.g. permanent brain in or out, brute
force in or out, random move in or out).  If these are not quoted then how can
we be sure that any particular tester has taken these critria into account and
adjusted for optimised play.  In fact, as above,  if they are not mentioned how
do we know if the tester is aware of the adjustment or its influence on the
playing strength of the program?

Default Strongest Mode
7.  One big plus is the strongest playing mode option now being included in
certain programs (e.g. Hiarcs 7.01 and Rebel 10c).  However this option may not
specifiy the opening book that must be used which could have a marked affect on
the playing strength of the program.

Operator Skill
In years gone by variances in thinking time per move during the opponents
thinking time as a result of delays in transferring moves between programs when
performing computer v computer games, had less of an effect on the final
results, ignoring auto-tester interfaces at present.  With the advent of faster
and faster hardware smaller delays in transference of move could result in a
program, waiting for the opponents reply, to think deeper and find a winning
combination or winning line, in the delayed time it takes to transfer the move,
if the waiting program predicts the correct response from its opponent then it
has an advantage imposed by operator delay.  This delay may have been due to a
natural break taken by the operator at a crucial turning point in a game.
Details of this nature are rarely included in the results of the game so the
facts are lost forever, and one program performs better than normal, whilst the
other shows poorer results.


Hash Ram
The amount of Ram made available for the program to use for Hash ram purposes
can have a marked affect on the results of a program, especially in an endgame
situation.  The combination of Program level, size of hash tables and speed of
processor can have a marked affect.  For example, at Blitz or fast time controls
large hash tables may be a disadvantage for some programs.  At relatively slow
time controls say with Fritz32 128Mb hash is sometimes insufficient hash ram to
cater for the needs of the program.  The manual suggests that Fritz is not
performing optimally if it has run out of hash ram.  If this situation occurs in
a game then the program is disadvantaged and could suffer a consequent loss.  Is
this information being supplied when it occurs as results are submitted? I
suspect not! How many testers are running 128Mb machines?  How many testers are
running 64Mb machines with 400 or 450 MHz processors?  They are likely to meet
the hash wall problem sooner and more often.  I meet it on my 350Mhz machine and
I have 128Mb ram on the machine.  Does this mean I should upgrade to a minimum
of 254Mb Ram.  Does the type of Ram have an influence on the running of the
program.  SDRAM is supposedly faster than EDO Ram, but by how much and would it
have an effect on the results?  I would expect, at slower processor speeds, the
affect would be minimal but as processor speeds increase I would expect an
increase in the effect imparted by the ram speed.


Operating System Setup
 Closer control of the operating environment will enable more repeatable
results, but I see virtually no details given of control of the other processes
running in a specific environment.  For example, if we are running win98 have we
switched the screen saver off?  Have we cancelled as many other processes as we
possibly can?  Have we paused the job scheduler.  Any of these can have an
influence on the processing time available to the program.  At a critical time a
winning advantage can peter out into a dull draw, or a draw can become a loss,
where prime processor time was delayed slightly , especially where a program has
time control problems.

Operating System Environment
The operating system in use, under which the progam is being run can have an
affect on the speed of execution of the program.  Are we using native DOS, or a
DOS shell under WIN95/98 or a DOS shell under NT4?  Are we running a windows
program under Win95/98 or NT4.  Is the program being run as 16bit or 32 bit.
The combination of Processing chip, operating system and the way the chess
program was written will have an affect on the speed with which the program
executes in certain environments.  For example, if we run a DOS based program in
a DOS shell under Win95 on a Pentium Pro 200 Mhz machine I would expect worse
results than to run the same program under a DOS shell on a Pentium/MMx/200Mhz.
This is because the Pentium Pro chip was designed by Intel to be optimized for
running 32 bit code and 32bit operating systems.  It is a known fact that 16bit
operating systems run at inferior speeds on a Pentium Pro, due to the conversion
processes behind the scene that come into operation.  I would expect to see
better results if the same program was running under a DOS shell under NT4, as
this operating system, can take advantage of the specifics of the Pentium pro
instruction set.  However, there is still a slow down in the execution of 16bit
code , and then a further slow down for the DOS shell.  But how many chess
program testers are aware of this, and try to avoid inferior environments for
chess programs?  As the details are not always quoted, certainly not operating
systems, then the full facts remain largely unknown.  As a result strong
programs can be misrepresented by publishing results where programs have been
run under inferior conditions, and then unwittingly omitting to quote the
operating environment specifics.

Interim Program versions
There is now a new factor that has come into play and can have a marked affect
on the results of a chess program.  Program authors are now running web sites
and allowing interim downloads of bug fixes and small increases in program
strength.  This has resulted in a/b/and c version programs being available on
the market place.  Or in the case of Hiarcs, an interim version number of the
main release version 7.  Often results are quoted and version details of the
program are again not quoted with the results.  The relevance of the minor
versions are being lost where results are amalgamated with results of the base
version of the program.

Repeated Games Correlation
 Another consideration that I have very rarely seen discussed is the correlation
of results for one program from several diferent owners.  It is possible that
the same program could play the same game under several different owners, and
all owners submit their results to the same results list.  This list may well
have several games recorded as wins or losses for one program when those results
are one and the same game, and therefore in my opinion should only ever account
as one game win or loss for the life of that program.

Supply of Gamescores
 I am also aware due to the resources available that chess media editors do not
always have time to examine every game submitted to backup submitted results.  I
have known results to be submitted for publication without the backup of the
gamescores being supplied.  Naturally editors are only going to allow this
situation where their testers are known and reliable, but this leads to the
inability to substantiate the results published.  Not to mention the inability
to determine trends where a program is taking several losses, if the gamescores
are not supplied how can the program author improve the program? Once again this
can have commercial implications for the programmers concerned.

Program Levels
The levels that programs are set for testing purposes I consider can have a
marked affect on the results of that program.  Some programs are strong at
faster time controls but run into the law of diminishing returns as longer time
controls are set.  Some media editors are happy to amalgamte all results for one
program as long as a minimum time control is used (say game in 1hr).  Naturally
to avoid BT phonebook size newsletters it is sensible to amalgamte results where
appropriate.  In previous years with slower processor speeds amalgamation of
results at differing levels was not so much of a problem. I think we have a
different situation now where different levels have a critical affect on the
programs results.  I am aware we can draw the parallel that human GMs play at
varying speeds and have one grading.  However, if we are to get a true
assessment of a programs strengths, we must categorise the results in more
detail.  This will enable prospective purchasers to make better judgements
before purchasing and and will give the program authors better pointers as to
where they should concentrate their efforts for future improvements.

Summary and Conclusions
The above criteria for changes in the running environment, individually in
themselves, may  or may not not be sufficient to show large variances in results
for any one particular program.  However, put all the variances together and
multiply them by the number of units sold, for which results are submitted and
we have a reasonable recipe for sufficient variance to consider current results
to be unreliable, unless all criteria are quoted for each game that is played.

It could be considerd that what we are doing here is System Testing a computer
program.  In the real world a very scientific approach is used to the testing
phase, and the gaping holes detailed above would render the testing process
invalid due to the number of unknowns in the testing process.

At present there are a number of Chess Computer result lists arouind the world.
To my knowledge not one of them caters for the variances that can be imposed on
a program.  With this in mind I have to pose the question, 'What is the value of
reporting test results when the test criteria are largely unpublished?'.

I do enjoy pitting chess computers against each other to determine which is the
stronger, but the wider issue of publishing results, which may very well
influence commercial success of the programs is a topic that requires more work
and much greater control, in my opinion.

We want to reward good chess programmers for their good improvements by
reporting true results rather than unwittingly hiding their program improvements
with inferior program setups, leading to inferior results.

I suspect there are a number of chess computer owners out there who have a
tremendous amount of  enthusiasm for the game, and perhaps specifically for
computer testing, who unfortunately, due to their background, are unscientific
in their approach to the testing process.  I would not want to discourage these
enthusiasts from further testing.  However, if we can point out the criteria
that can affect the success of a program, perhaps we can raise the integrity of
chess computer testing to a level where most results are reliable and qualified
in most if not all cases.

Has anyone else out there got any ideas on this subject?

I would be interested to obtain other points of view on the subject.

Keith Kitson





This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.