Author: Don Dailey
Date: 16:37:36 12/28/97
Go up one level in this thread
>Yes. Self-test is a strange thing. I do it since 1 month, and discovered >you have to take the results with care. > >The first experiments I did were self-test program A against A (that is >EXACTLY the same program playing both sides). I use a set of predefined >openings, each program playing white, then black in each opening. 10 >openings give 20 games for example. > >What a surprise: self-playing a program against itself in these >conditions almost never gives the expected 50% result! > >What should have been a simple procedure then becomes a nightmare. > >First of all, why doesn't the match end in a 50% result??? Answer: >because of timing errors. I use the PC internal timer to measure the >thinking time, and the clock precision is 1/18.2 second (roughly). A >search beginning right after a clock tick is different from the same >search started just BEFORE a clock tick, the difference being the >measured time. So a chess engine could sometimes decide to continue a >search, and sometimes decide to stop it, randomly... > >Second point: OK, I understand why the result is not exactly 50%. But >shouldn't it be close to 50%? And how close? Answer: make several >matches between exactly the same programs, and write down the minimum >and maximum score you get. > >Ok, good idea. Let's try... Well the results range from 39% to 58% when >I do 20 games matches. When I try 24 games matches, I get 46%-56% > >In order to stop the nightmare, I now do 28 games matches (14 openings), >and decide the result is significant if it is below or equal to 40%, or >above or equal to 60%. > >The problem is that a change has to be quite good to win a match by 60% >or more! If it is a slight improvement, chances are you never notice it >by self-play! But I don't know any better way, because I still want my >self-test sessions to be reasonnably short enough to be able to test >many ideas. A 28 games match at blitz rate usually takes more than 4 >hours. So in 1 day, I can only make 3 tests... I use a P100 computer to >do self-test 24h/day, and a K5-100 computer to program. I use self-test >to verify tactical ideas only. I didn't try yet to test positional >evaluation changes, and I'm afraid to see what could happen in that >case. Here is how I do SELF testing and some of my thoughts on it. I have 200 very shallow openings and they go to exactly 5 moves for each player or 10 ply. Larry Kaufman picked them so that they are all relatively normal, span a lot of theory but do not get you too deeply into the game (so we test opening heurstics too.) 200 openings gives us a total of 400 games, so we try to test in batches of 400 games. There are 2 types of levels I use, fixed depth and fixed nodes. Fixed depth is self explanatory and I rarely use this level but can be useful for debugging and proving the value of certain algorithms. For instance if I try out a new extension idea it better at least improve the play at fixed depths. The other level is more interesting. It is fixed nodes and I designed it specifically for self testing. The program will search until the exact number of nodes specified has been searched and then will stop. This is very nice for self testing because it gives me platform independence. I have a lot of systems I can, and do, test on and they vary greatly in the power of the harware and even the operating system. But I can directly compare any result from any platform without any worry when I do fixed node testing. When I test 2 identical versions I always get exactly a 50-50 result. I designed my program to be completely deterministic (in these kind of games) and this has also proven to be a good debugging aid. I never do thinking on the opponents time with this kind of testing. I believe that it has no serious impact on the results (since both sides would benefit from it equally) and that because of this it is a waste of machine resources. Also it would completely defeat all the effort I went to to isolate the hardware performance from the results. The auto-tester I use I built myself and spawns off two separate chess programs. Each program can be completely different programs or the same program with different command line options. So I can test any version going back quite some time. From time to time I test against a version several months old to see if I have made any progress. I don't claim that this approach does not have some drawbacks. If I make an improvement that only affects the speed of the search and not the node counts this method will not pick it up. But I always know exactly what I'm testing and why so in this case I use a general timming test to measure my speedup and do not worry about the exact score this kind of speedup should give me. But all in all, I am very pleased with this methodolgy. It gives me wonderful consistancy and saves me a whole lot of hassle in the long run. I try to apply common sense to tell me when this kind of testing is not appropriate. When I test evaluation improvements, the first thing I do is run this self test sequence to look for bugs and "evaluate" the algorithm itself. I do not worry about if the algorithm is implemented efficiently or how much it slows the program down. I consider this an excellent practice, because I can test changes very quickly this way and learn a lot in the process. Since many algorithms do not pass this first test, I've saved myself a lot of time. But when an algorithm looks like a "keeper" I must evaluate its impact on search speed. Often it's clear right away whether to use the algorithm or not but when it's a close call I have to look at it a little harder. But my argument is that if its a close call, it's very difficult to measure it accurately with any method. I do not think 1000 time games would help me much here. So yes, my usual method of testing has some drawbacks but it has a lot of advantages for me. I have access to many machines here at M.I.T. and they all run at different speeds or may have other jobs running on them but it doesn't matter with this kind of testing. Sometimes when I'm at home I'll get a brilliant idea, implement it and then log in to several computers at work and start tests running. when I get to the lab the next day I will have hundreds of games ready for me. Usually my brilliant idea was not so brilliant! -- Don
This page took 0.02 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.