Computer Chess Club Archives


Search

Terms

Messages

Subject: Getting Testy

Author: Ricardo Gibert

Date: 13:14:47 08/07/05


It seems that testing is a significant problem for developing a chess program.
You would like to make a large number of tests at a slow controls, but this
takes way too long to assess small improvements to your program. A shortcut is
needed that indirectly accomplishes the same thing despite any caveats you may
have about such a shortcut.

An analogous situation occurs with the testing of pharmaceuticals. You would
like to test on humans, but this is not acceptable. Instead, they test on
animals with extra large doses. They use animals, because they can't risk human
life. They use extra large doses to simulate long term use. Both methods methods
strictly speaking are invalid, but they are assumed true for culling an
experimental pharmaceutical from further consideration. It's a practical
compromise. Why not the same for chess programs?

Here's a hypothetical situation for you programmers. Let's suppose you make a
modification to your program and to get a feel of how well the mod works for
your program, you run a 1000 game test of 1 minute chess against a variety of
other engines. The result is say a 40% score, when 45% is what your program
would normally have scored without the mod, so you seem to have done worse. Now,
instead of dumping or tweaking the mod, you run another test of 1000 games, but
with the somewhat longer TC (time control) of 2 minute chess and score 44%. An
extrapolation would suggest that the program would do better than the unmodified
version.

Here are my questions: How reasonable is it to assume that your program will
increase its score with an even longer TC? Would you be willing to cull the mod
based on such testing? Accept it? Use such testing to gauge a mod for further
testing?

One major problem for this idea is that instead of measuring the effect of your
mod, you may be measuring the fact that your program's time management gets
relatively better at longer TCs. You can eliminate this possibility (and many
others) by testing at the 2 different fast TCs without the modification and
comparing those results with the results with the modification. Here is an
example:

    TC       1 min   2 min
    w/o mod:  45%     46%
    w   mod:  40%     44%

Apparently, an improvement with the mod as far as long TCs go.

More questions: Does this idea have any practical value or would the results of
any such testing be too often inconclusive to have any practical value? Has
anybody already tried this? Has anybody tried this and found it wanting? What is
wrong with this idea?



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.