Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Junior's long lines: more data about this....

Author: Don Dailey

Date: 16:37:36 12/28/97

Go up one level in this thread


>Yes. Self-test is a strange thing. I do it since 1 month, and discovered
>you have to take the results with care.
>
>The first experiments I did were self-test program A against A (that is
>EXACTLY the same program playing both sides). I use a set of predefined
>openings, each program playing white, then black in each opening. 10
>openings give 20 games for example.
>
>What a surprise: self-playing a program against itself in these
>conditions almost never gives the expected 50% result!
>
>What should have been a simple procedure then becomes a nightmare.
>
>First of all, why doesn't the match end in a 50% result??? Answer:
>because of timing errors. I use the PC internal timer to measure the
>thinking time, and the clock precision is 1/18.2 second (roughly). A
>search beginning right after a clock tick is different from the same
>search started just BEFORE a clock tick, the difference being the
>measured time. So a chess engine could sometimes decide to continue a
>search, and sometimes decide to stop it, randomly...
>
>Second point: OK, I understand why the result is not exactly 50%. But
>shouldn't it be close to 50%? And how close? Answer: make several
>matches between exactly the same programs, and write down the minimum
>and maximum score you get.
>
>Ok, good idea. Let's try... Well the results range from 39% to 58% when
>I do 20 games matches. When I try 24 games matches, I get 46%-56%
>
>In order to stop the nightmare, I now do 28 games matches (14 openings),
>and decide the result is significant if it is below or equal to 40%, or
>above or equal to 60%.
>
>The problem is that a change has to be quite good to win a match by 60%
>or more! If it is a slight improvement, chances are you never notice it
>by self-play! But I don't know any better way, because I still want my
>self-test sessions to be reasonnably short enough to be able to test
>many ideas. A 28 games match at blitz rate usually takes more than 4
>hours. So in 1 day, I can only make 3 tests... I use a P100 computer to
>do self-test 24h/day, and a K5-100 computer to program. I use self-test
>to verify tactical ideas only. I didn't try yet to test positional
>evaluation changes, and I'm afraid to see what could happen in that
>case.


Here is how I do SELF testing and some of my thoughts on it.

I have 200 very shallow openings and they go to exactly 5 moves for
each player or 10 ply.  Larry Kaufman picked them so that they are
all relatively normal, span a lot of theory but do not get you too
deeply into the game (so we test opening heurstics too.)

200 openings gives us a total of 400 games, so we try to test in
batches of 400 games.

There are 2 types of levels I use, fixed depth and fixed nodes.
Fixed depth is self explanatory and I rarely use this level but
can be useful for debugging and proving the value of certain
algorithms.  For instance if I try out a new extension idea it
better at least improve the play at fixed depths.

The other level is more interesting.  It is fixed nodes and I
designed it specifically for self testing.  The program will
search until the exact number of nodes specified has been searched
and then will stop.   This is very nice for self testing because
it gives me platform independence.  I have a lot of systems I
can, and do, test on and they vary greatly in the power of the
harware and even the operating system.    But I can directly
compare any result from any platform without any worry when I
do fixed node testing.

When  I test 2 identical versions I always get exactly a 50-50
result.   I designed my program to be completely deterministic
(in these kind of games) and this has also proven to be a good
debugging aid.

I never do thinking on the opponents time with this kind of testing.
I believe that it has no serious impact on the results (since both
sides would benefit from it equally) and that because of this it is
a waste of machine resources.  Also it would completely defeat all
the effort I went to to isolate the hardware performance from the
results.

The auto-tester I use I built myself and spawns off two separate
chess programs.  Each program can be completely different programs
or the same program with different command line options.  So I
can test any version going back quite some time.   From time to
time I test against a version several months old to see if I have
made any progress.

I don't claim that this approach does not have some drawbacks.  If
I make an improvement that only affects the speed of the search and
not the node counts this method will not pick it up.  But I always
know exactly what I'm testing and why so in this case I use a general
timming test to measure my speedup and do not worry about the exact
score this kind of speedup should give me.

But all in all, I am very pleased with this methodolgy.  It gives
me wonderful consistancy and saves me a whole lot of hassle in the
long run.  I try to apply common sense to tell me when this kind of
testing is not appropriate.

When I test evaluation improvements,  the first thing I do is run
this self test sequence to look for bugs and "evaluate" the algorithm
itself.  I do not worry about if the algorithm is implemented
efficiently or how much it slows the program down.   I consider this
an excellent practice, because I can test changes very quickly this
way and learn a lot in the process.   Since many algorithms do not
pass this first test, I've saved myself a lot of time.

But when an algorithm looks like a "keeper" I must evaluate its impact
on search speed.   Often it's clear right away whether to use the
algorithm or not but  when it's a close call I have to look at it a
little harder.  But my argument is that if its a close call, it's very
difficult to measure it accurately with any method.  I do not think
1000 time games would help me much here.

So yes, my usual method of testing has some drawbacks but it has a
lot of advantages for me.  I have access to many machines here at
M.I.T. and they all run at different speeds or may have other jobs
running on them but it doesn't matter with this kind of testing.

Sometimes when I'm at home I'll get a brilliant idea, implement it
and then log in to several computers at work and start tests running.
when I get to the lab the next day I will have hundreds of games ready
for me.  Usually my brilliant idea was not so brilliant!


-- Don













This page took 0.02 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.