Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Junior's long lines: more data about this....

Author: Don Dailey

Date: 20:38:12 12/28/97

Go up one level in this thread


On December 28, 1997 at 22:34:10, Fernando Villegas wrote:

>Don:
>Can you explain me why is not better -IF is not better-to test a program
>or its differents versions of it just testing them with a real big set
>of problems of tactical and positional nature, where best and second
>best moves has been stablished?  I presume that the program or version
>that solve more and/or in less time, should be the better. Cannot be
>otherwise. On the contrary, testing each of them againts the others let
>the chance that a certain version that objetively, in chess standards,
>is not the best, is at the same precisely the kind that defeats the real
>best one. You know, as happens between human players, where sometimes
>there is a black beast even for the best, nobody knows why. I mean,
>maybe you can thrash precisely the big winner because you faced it with
>his black beast. OK, I know you alsoo use this mnethod, but I was
>thinking in all you said in the talk with Theron about this system of
>doing hundreds of games and I recalled this weird phenomena of
>"precisely" the black beast. I would be very afraid to lose something
>real good because of that kiond of suden death method.
>Fernando, just a customer with a great mouth.

Hi Fernando,

I use problems sets for debugging and 1st order testing of tactical
improvements/extensions etc.   I have a few sets I understand well,
I know which algorithms each problem responds to and why.  So they
work well for debugging stuff.

But so far I've never seen a great problem set that can accurately
measure improvements to the program.   It's almost as if a lot of
algorithms that help problems hurt real play.

But I believe a good set could be constructed but it would have to
contain a lot of positions and all different kinds of them.  There
should be many positional problems in them too.

But another problem is that problem sets are very difficult to
construct.  It often turns out that there are multiple solutions
or there is some real question about whether a move really  is
best.  When Larry Kaufman constucted sets for us he often threw
out problems after a time, in short even his set evolved with
time.

But another problem with problem sets is that it doesn't measure
how well your program integrates itself into the play.  Your program
may have many weaknesses that a problem set could uncover but
your program may be very skillful at avoiding these weaknesses.
This is similar to how humans play too.  If I cannot handle a
particular opening very well I simply avoid it!   If someone tested
me on this opening they might conclude I'm even weaker than I
really am.

The real issue with any kind of testing is how much information do
you get back for your investment in time?   If version A wins 1 game
against version B you know almost nothing about there relative strength
and only have a very weak hypothesis that A is stronger.  But if they
a closely matched, it could take a lot of games to determine who is
the real favorite.

So not too much information is bound to a single game.   But you can
think of a game as a problem set that is generated on the fly.  The
program that makes the best choices in this problem set will be the
winner.   But each position returns different information content.
A completely obvious move tells you little about the program, any
program will quickly find this move.   But a really sophisticated
move can tell you much more.   How to separate out these positions?

I do not know.

-- Don


I did a really interesting study once several years ago.  I took
a small problem set and adjusted the weights to predict the Swedish
ratings of several programs.  You can use various methods to do
this, I used a genetic algorithm.  I was able to come up with a
formula which was very accurate, within about 10 points for ANY
program that was involved in the test.

This test should probably be repeated.  It should involve as many
accurately rated programs as possible.  The opening book hacks that
some programmers may be using could hurt the accuracy of this
test though since there is a possibility the book is the main
source of the programs strength.

To be really accurate I think it's a mistake to only count total
problems solved.  Time of solution should be a factor.  Because
this is meaningful information that should not be thrown away.

But it turns out this is the simplest thing to do, it's much harder
to construct a good scoring function for problem sets that take
time into account and allows you to not solve some problems.

-- Don



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.