Computer Chess Club Archives


Search

Terms

Messages

Subject: methods for testing changes

Author: Steffen Jakob

Date: 02:57:54 05/30/01


Hi all,

it seems as if the last modifitication I made made Hossa play
worse. What I need are some good methods for testing. At the moment my
testing is rather chaotic. I observe games at ICC and if I see
something strange I try to fix it. Then I let Hossa play again at ICC
and look how it works. "to see how it works" is surely not a good way
for testing and the ICC rating isnt also reliable.

I know from others that they do basically the same. I am not happy
with this at all. I rather would like to see a well defined strategy
how to test changes automatically. Here is some short brain storming
and I would like to get some feedback.

- test suites (I dont like test suites very much for testing, because
  those positions are mostly very tactically and rather
  exceptional). There should be used different test suites which
  emphasize different themes (tactics, endgames, pawn endgames,
  exchange sacs, ...)
  data to compare: #solutions, #nodes, time

- static eval tests: I think of a set of positions where I dont look
  for a best move but for a static eval value. An eval range would be
  assigned to each position and if the engine“s static eval is within
  this range, then it matches.
  data to compare: #solutions

- static eval order: this is similar to the point above. Here I want
  to specify a set of ordered positions. The "best" position is
  ordered first etc.. This point is interesting. Here you can test
  if the engine prefers certain patterns to other patterns.
  data to compare: #solutions

- effective branching factor test: given is a set of "typical"
  positions for opening, middlegame, endgame. The engine computes each
  position a certain time and writes down the effective branching
  factor.
  data to compare: effective branching factors

- % fail highs in first move: similar as above. For the same set of
  "typical" positions the percentage of 1st move fail highs is measured.
  data to compare: %

- self games: a reasonable number of games is played vs older versions
  of the same program. Which openings? learning? Maybe the openings
  should be given so that the results are better comparable.
  data to compare: score for different time controls

- matches vs other programs: similar as above. a reasonable number of
  games is played vs other programs.
  data to compare: score for different time controls


All tests can be done automatically and produce good results which can
be compared directly.

Greetings,
Steffen.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.