Author: Keith Kitson
Date: 04:49:40 04/18/99
To all Chess Computer Enthusiasts, For some time now I have been watching the relative testing environments and contemplating the amount of information used to describe the conditions under which chess programs are tested against other programs. I now feel it is time to consider the variances that can affect the results of any chess program, that remain unspecified. We are now at the stage where smaller increases in chess program strength are being experienced as the chess programmers hit a learning curve which is getting shallower in achievement, at the present time. These small increases in strength can be negated by variations in hardware and operating system much more easily than in the past. This could mean that a program could very well perform worse than its predecessor as the hardware/operating system combination used, cancels out the benefits of code improvements. But as the categories of hardware are amalgamated (or grouped for convenience) and no operating system details are being quoted, the inferior hardware results are hidden in the general category available to record results for any specic chess program. With this in mind I feel we need to be a lot more specific about how we specify the environments under which chess program results are quoted, otherwise we are in danger of quoting some program results under inferior environments alongside superior environments and losing the differences forever by incorporating them into the same category of results. I can understand the need for chess media editors to keep things simple which helps with layout, reduces number of pages, etc. but I reckon we have reached a point at which we need to be more specific in these areas. If we are not then there is a greater danger that hard earned small program code improvements do not show their true potential in the market place and code authors become disillusioned, and give up the upgrading process altogether. This would be a sad loss for chess programming in general. To illustrate my point further a number of years ago Novag Super forte was showing poorer results on some modes in the Swedish lists (if my memory serves me well it was do with the setting of the number of ply depth that the operator could set for the selective search value). Once this was determined it was decided to discount (separate?) the weaker mode from the results list as it was clearly inferior, and not representative of the stronger modes. This problem came about because results from varying modes, and variances in program settings were rolled into one category of results (pulling the better results obtained in the stronger mode down to the worse results end), when in fact we had at least two clear categories that results should have been recorded under, one category clearly inferior to the other. I think we have a parallel now in that, due to the smaller increases in program strength, we have sufficient variations in Hardware, Operating Systems, and chess program internal settings, being rolled into one category for reporting of results, that all poorer results are bringing the better results down, to a point where code improvements are all but negated. What perhaps is more worrying is that the great majority may not be aware of the affect that a change in operating environment can have over the playing ability of any specific chess program, for any given fixed setup for that specific program. So results are genuinely being submitted for programs with the following details largely unquoted: Choice of Opening Book Which opening book is being used? ( a massive 7.7Mb book is not necessarily stronger - Fritz programs power book for instance) The book supplied as standard by the program author is normally the book that the author expects to be used for best results. I see very little qualification as to which opening book is being used, we cannot just assume that all results have used the strongest opening book. The details must be quoted. Style of Opening Book Many programs on the market place have styles of opening book which can affect the strength of the program. If the style is not quoted with the results we cannot naturally assume only the strongest opening book style was used for all test results. Program Learning Function Whether the program has been playing for some time with learning enabled, whether certain lines will be closed off by the program as it has lost using these lines and considers that, under its evaluation algorithm, these lines are now inferior (rightly or wrongly). If the program is re-installed it is possible that this learning file is ovewritten which leaves the program open to losing with the same weaknesses it had previously marked to avoid. Conversely if the learning facility is switched off, the program is more likely to lose two replica games by the same error, as it is not learning from previous games. Program Playing style Style of play (e.g. normal, aggressive, solid, active etc). Greater influence can be administered here than with some of the other criteria mentioned Ply Depth Setting 5. Depth of selective search, where this can be adjusted, has a strong influence on the results of a program. If this is not quoted then how can we be sure that any particular tester has taken this critrium into account and adjusted for optimised play. In fact if ply depth setting is not mentioned how do we know if the tester is aware of the adjustment or its influence on the playing strength of the program? Other Internal Program Adjustments 6. Any other adjustments from the strongest settings that may have an affect on the playing strength of the program. (e.g. permanent brain in or out, brute force in or out, random move in or out). If these are not quoted then how can we be sure that any particular tester has taken these critria into account and adjusted for optimised play. In fact, as above, if they are not mentioned how do we know if the tester is aware of the adjustment or its influence on the playing strength of the program? Default Strongest Mode 7. One big plus is the strongest playing mode option now being included in certain programs (e.g. Hiarcs 7.01 and Rebel 10c). However this option may not specifiy the opening book that must be used which could have a marked affect on the playing strength of the program. Operator Skill In years gone by variances in thinking time per move during the opponents thinking time as a result of delays in transferring moves between programs when performing computer v computer games, had less of an effect on the final results, ignoring auto-tester interfaces at present. With the advent of faster and faster hardware smaller delays in transference of move could result in a program, waiting for the opponents reply, to think deeper and find a winning combination or winning line, in the delayed time it takes to transfer the move, if the waiting program predicts the correct response from its opponent then it has an advantage imposed by operator delay. This delay may have been due to a natural break taken by the operator at a crucial turning point in a game. Details of this nature are rarely included in the results of the game so the facts are lost forever, and one program performs better than normal, whilst the other shows poorer results. Hash Ram The amount of Ram made available for the program to use for Hash ram purposes can have a marked affect on the results of a program, especially in an endgame situation. The combination of Program level, size of hash tables and speed of processor can have a marked affect. For example, at Blitz or fast time controls large hash tables may be a disadvantage for some programs. At relatively slow time controls say with Fritz32 128Mb hash is sometimes insufficient hash ram to cater for the needs of the program. The manual suggests that Fritz is not performing optimally if it has run out of hash ram. If this situation occurs in a game then the program is disadvantaged and could suffer a consequent loss. Is this information being supplied when it occurs as results are submitted? I suspect not! How many testers are running 128Mb machines? How many testers are running 64Mb machines with 400 or 450 MHz processors? They are likely to meet the hash wall problem sooner and more often. I meet it on my 350Mhz machine and I have 128Mb ram on the machine. Does this mean I should upgrade to a minimum of 254Mb Ram. Does the type of Ram have an influence on the running of the program. SDRAM is supposedly faster than EDO Ram, but by how much and would it have an effect on the results? I would expect, at slower processor speeds, the affect would be minimal but as processor speeds increase I would expect an increase in the effect imparted by the ram speed. Operating System Setup Closer control of the operating environment will enable more repeatable results, but I see virtually no details given of control of the other processes running in a specific environment. For example, if we are running win98 have we switched the screen saver off? Have we cancelled as many other processes as we possibly can? Have we paused the job scheduler. Any of these can have an influence on the processing time available to the program. At a critical time a winning advantage can peter out into a dull draw, or a draw can become a loss, where prime processor time was delayed slightly , especially where a program has time control problems. Operating System Environment The operating system in use, under which the progam is being run can have an affect on the speed of execution of the program. Are we using native DOS, or a DOS shell under WIN95/98 or a DOS shell under NT4? Are we running a windows program under Win95/98 or NT4. Is the program being run as 16bit or 32 bit. The combination of Processing chip, operating system and the way the chess program was written will have an affect on the speed with which the program executes in certain environments. For example, if we run a DOS based program in a DOS shell under Win95 on a Pentium Pro 200 Mhz machine I would expect worse results than to run the same program under a DOS shell on a Pentium/MMx/200Mhz. This is because the Pentium Pro chip was designed by Intel to be optimized for running 32 bit code and 32bit operating systems. It is a known fact that 16bit operating systems run at inferior speeds on a Pentium Pro, due to the conversion processes behind the scene that come into operation. I would expect to see better results if the same program was running under a DOS shell under NT4, as this operating system, can take advantage of the specifics of the Pentium pro instruction set. However, there is still a slow down in the execution of 16bit code , and then a further slow down for the DOS shell. But how many chess program testers are aware of this, and try to avoid inferior environments for chess programs? As the details are not always quoted, certainly not operating systems, then the full facts remain largely unknown. As a result strong programs can be misrepresented by publishing results where programs have been run under inferior conditions, and then unwittingly omitting to quote the operating environment specifics. Interim Program versions There is now a new factor that has come into play and can have a marked affect on the results of a chess program. Program authors are now running web sites and allowing interim downloads of bug fixes and small increases in program strength. This has resulted in a/b/and c version programs being available on the market place. Or in the case of Hiarcs, an interim version number of the main release version 7. Often results are quoted and version details of the program are again not quoted with the results. The relevance of the minor versions are being lost where results are amalgamated with results of the base version of the program. Repeated Games Correlation Another consideration that I have very rarely seen discussed is the correlation of results for one program from several diferent owners. It is possible that the same program could play the same game under several different owners, and all owners submit their results to the same results list. This list may well have several games recorded as wins or losses for one program when those results are one and the same game, and therefore in my opinion should only ever account as one game win or loss for the life of that program. Supply of Gamescores I am also aware due to the resources available that chess media editors do not always have time to examine every game submitted to backup submitted results. I have known results to be submitted for publication without the backup of the gamescores being supplied. Naturally editors are only going to allow this situation where their testers are known and reliable, but this leads to the inability to substantiate the results published. Not to mention the inability to determine trends where a program is taking several losses, if the gamescores are not supplied how can the program author improve the program? Once again this can have commercial implications for the programmers concerned. Program Levels The levels that programs are set for testing purposes I consider can have a marked affect on the results of that program. Some programs are strong at faster time controls but run into the law of diminishing returns as longer time controls are set. Some media editors are happy to amalgamte all results for one program as long as a minimum time control is used (say game in 1hr). Naturally to avoid BT phonebook size newsletters it is sensible to amalgamte results where appropriate. In previous years with slower processor speeds amalgamation of results at differing levels was not so much of a problem. I think we have a different situation now where different levels have a critical affect on the programs results. I am aware we can draw the parallel that human GMs play at varying speeds and have one grading. However, if we are to get a true assessment of a programs strengths, we must categorise the results in more detail. This will enable prospective purchasers to make better judgements before purchasing and and will give the program authors better pointers as to where they should concentrate their efforts for future improvements. Summary and Conclusions The above criteria for changes in the running environment, individually in themselves, may or may not not be sufficient to show large variances in results for any one particular program. However, put all the variances together and multiply them by the number of units sold, for which results are submitted and we have a reasonable recipe for sufficient variance to consider current results to be unreliable, unless all criteria are quoted for each game that is played. It could be considerd that what we are doing here is System Testing a computer program. In the real world a very scientific approach is used to the testing phase, and the gaping holes detailed above would render the testing process invalid due to the number of unknowns in the testing process. At present there are a number of Chess Computer result lists arouind the world. To my knowledge not one of them caters for the variances that can be imposed on a program. With this in mind I have to pose the question, 'What is the value of reporting test results when the test criteria are largely unpublished?'. I do enjoy pitting chess computers against each other to determine which is the stronger, but the wider issue of publishing results, which may very well influence commercial success of the programs is a topic that requires more work and much greater control, in my opinion. We want to reward good chess programmers for their good improvements by reporting true results rather than unwittingly hiding their program improvements with inferior program setups, leading to inferior results. I suspect there are a number of chess computer owners out there who have a tremendous amount of enthusiasm for the game, and perhaps specifically for computer testing, who unfortunately, due to their background, are unscientific in their approach to the testing process. I would not want to discourage these enthusiasts from further testing. However, if we can point out the criteria that can affect the success of a program, perhaps we can raise the integrity of chess computer testing to a level where most results are reliable and qualified in most if not all cases. Has anyone else out there got any ideas on this subject? I would be interested to obtain other points of view on the subject. Keith Kitson
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.