Author: Ratko V Tomic
Date: 15:01:50 11/21/99
Go up one level in this thread
> Don't get me wrong, I'm not objecting to that there is a negative effect
> with increased depth, I was simply questioning the explanations you
> gave ("increased chance of choosing the incorrect branch" and "the
> uncertainty increases as the values back up the tree").  They didn't
> sound right to me.
Great deal of the apparent disagreement here is due to the
fuzzy conceptual boundaries in the conventional terminology
between the "true chess game" (TCG) and the model of the chess
game (MCG) that chess program creates. The TCG and MCG, while
similar, are not the same game.
Namely, the TCG is a structure which contains entire chess
tree, with exact value 1, 1/2 and 0 at each leaf node and the
minimax rule of value propagation (defining it thus as a 2
player game) which assigns to each inner node a minimaxed value
(1, 1/2 or 0) and a ply count to the nearest (in minimax
traversal sense) leaf node having the same value. The TCG
structure, representing a finite game, is therefore a static
tree, like a preinitialized array of constants in memory,
frozen in time, with no dynamics or events.
The MCG is a scaled model that a chess program builds
internally in an attempt to simulate TCG. The MCG structure
always contains a skeletal tree, which in terms of the nodes
(board positions) and the connectivity (legal moves), as far as
it goes, is identical to the corresponding skeletal tree
fragment of the TCG. But the similarity ends at that skeletal
identity. The node values, which are the essential defining
component of a full definition of a game, are different between
the TCG and MCG.
Of course, we can easily scale down the MCG extremal values to
match the TCG values 1, 1/2 and 0. But even after this scaling
to emphasize similarity, the MCG node values take not just the
3 values that TCG nodes do, but the entire spectrum of values
in between the three. One may try to get around this problem by
"quantizing" the spectrum of MCG values, and assign to each MCG
node the nearest of the 3 values 1, 1/2 or 0. But then, even
though the MCG tree will now have the 3 value numbers on each
node, just like the TCG, these numbers won't be same as those
of the TCG, i.e. even this form of the "quantized" MCG defines
a different complete game from the TCG. The MCG will also be
lacking the "minimum distances to the leaf" numbers (as defined
by the TCG minimax rule & its leaf values), since program's
algorithm can't produce these correctly (other than
exceptionally, in the tablebase case or partially, i.e. only
for some nodes reachable from the root, in the forced mate or
draw cases).
So we're dealing here with two different games, sharing the
same skeleton tree (as far as the MCG builds it), but quite
different in the remaining key aspects which make up the full
definition of a game. While MCG mimicks the minimax motions of
the TCG, the numbers (values) it is playing with are from a
fictitious game of its own creation.
Additionally, unlike the TCG, which is a static, frozen in
time, structure, the MCG is a dynamic structure in constant
creation and destruction of the node values and the tree
fragments which are being brought out into the program's
searchlight focus, only to sink back, microseconds later, into
the surrounding darkness. Unlike the fixed 3 values of TCG at
each node, the value pointer that MCG has on each node
oscillates as its computations continue. While the wild pointer
swings occur mainly in the initial iterations, the small
(0.1-0.2 pawns) oscillations persist throughout the
computation, coming from the positional terms which bring in
(report, notice) different things from different depths and
different branches being examined.
This dynamical system nature of MCG justifies the use of terms
such as forces or spreading or propagation or oscillations. The
additional judgment loaded attributes (positive/negative,
error, worsening, etc) we attach to those dynamical terms refer
to how close to the TCG values the MCG values come and whether
their change brings them closer of farther from the TCG values.
If a change in some dynamics defining parameter within the MCG
(say the D parameter to D+1) causes the MCG values to come
closer to the TCG values, we can characterize this as a
positive force (on the MCG value pointer) induced by the D->D+1
change within that fragment of the skeletal tree. But in
another fragment of the tree, the change of D->D+1 will drive
the MCG value pointer away from the corresponding TCG value.
Here we can say that the D->D+1 change induces negative forces
on the MCG value pointer. (Similarly, when we talk about the
tuning of the opening book to the program, we can look at it as
redirecting or restricting the MCG movements away from the tree
regions where the program has earlier encountered the negative
forces.)
Now, all that is meaningless if one is talking only about the
TCG (in fact, when talking about chess programs and their
properties, evaluations... etc, one is always talking about
MCG). There is no dynamics or change of anything there. But MCG
is a different game and such considerations are perfectly
meaningful in that game. Furthermore, the MCG(D) is a slightly
different game from the MCG(D+1). But since both are dynamical,
we can speak of different forces acting on each, and since they
do share even more common structure among themselves than
either does with TCG, we can compare these forces and their
directions (do they drive, and when, how and why, the MCG
pointers toward or away from the corresponding 3 TCG values) at
the same nodes in the shared skeleton. In that sense it is
perfectly meaningful to observe that MCG(D+1) will at some
significant fraction of the nodes experience forces (acting on
its value pointer) which are more negative than those of
MCG(D), and strongly enough so, to allow the MCG(D) to win 20%
of the games.
In some of those games won by the MCG(D), as you also
suggested, both programs may have experienced the same negative
force region (i.e. neither program knew they were blundering),
and by pure luck this joint sleepwalking lead to the decisive
gain by the MCG(D). But, it is also well known that in about 17
percent of cases, increasing depth changes the decision at the
root, not necessarily to the better one (as the observation of
the existence of the "joint sleepwalking" demonstrates).
Although I haven't seen experiments where the new "best" moves
(discovered by D+1 relative to D's "best" move) were classified
in the more objectively better/worse categories, I would expect
that in at least 10 percent of cases, (D+1)'s new choice would
be objectively worse than the D's choice. And if, in order to
separate better different causes, one discards the cases when
D+1 had made a better choice for obvious reason, i.e. by seeing
tactical gain not visible at depth< D+1, among the remaining
cases the D ply program can do only better since the tactical
vision gain is always in favor of the D+1 program.
Human chess players have, of course, each their own internal
model of the TCG, a personal MCG, different from programs' MCGs
and from the MCGs of other human players. Looking at the larger
picture, the internal scale model building and running forward
such models to examine consequences of different moves before
making a decision in the "real" world is a fundamental
algorithm, a modus operandi, of any adaptive system, from
bacteria (or even more primitive subsystems, the genetic
engines) through humans, or even larger systems (human
organizations, societies). It is cheaper and safer to try out
the responses first in the internal model world than it is to
blunder in the unforgiving (no take-back) "real" world. For
almost all of our moment to moment activity this modelling
occurs without our conscious attention. The models are so
ingrown into our thinking that we often forget (and some never
realize) that what any of us calls and thinks of as the "real"
world is an internal "toy" world, as it were, that each brain
makes up all on its own, by picking out and remembering the
innumerable patterns, regularities and correlations among the
electric pulses that reach it. It's therefore not surprising
that we make the chess programs in the same basic mold.
> On the other hand, I think you might be right about that
> there exists such  negative effects (although observing that a
> D-ply deep search occasionally  wins a D+1 search is not as
> such an indicator that there are).  I do agree with you, that
> the quality of the evaluation function can worsen with
> increased depth, and I think this is a far more plausible
> explanation.
Well the wins od D vs D+1 are an _indicator_ (i.e. a phenomenon
consistent with, even suggestive of existence) of negative
forces which appear with increased depth, not itself an
_explanation_. The score isn't an explanation or a cause but an
effect. As one of several possible explanations (or causes) I
suggested (as a generalization of the specific mechanism Bob
suggested) the nonlinearity of the positional evaluation
function, which makes the comparisons between positions which
are farther apart, be it in space/locale of action or time, (as
they would often be with greater depth) less reliable than
comparisons among only slightly different positions (as they
are at the shallower depths), since the linear approximation
(the adding up of terms) is generally more reliable for smaller
perturbations.
(In continuous domain one can expand analytical function as
f(x,y) = f(x0,y0) + A1*dx + B1*dy + A2*dx^2 + C11*dx*dy +
B2*dy^2 +...; for linear function A2, A3,.. B2,B3, C11,... are
all 0 so the value is exactly the sum of the [rescaled]
separate perturbations dx, dy; for "mildly" nonlinear f(x,y),
A2, B2,... are small and the first order sum works well, too,
provided dx, dy are "small", otherwise the higher order terms
with dx^2 etc start having an effect. The depth tuning that Bob
mentioned apparently would still keep the linear form (i.e. add
up the gains), but it would rescale A1 and B1 so that the plane
slope would match the slope of the nonlinear surface better.
While this apprach can fix one type of problems observed, it
can worsen others, making thus the programs play more
unstable.)
The Samuels' checkers learning algorithm took this into account
when he didn't merely add up various separate positional gains,
but had used lookup tables indexed by the multiple types of
positional values and giving the result which didn't have to be
a sum of the index values (the program tuned this non-linear
mapping via the self-play). Now, it is true that chess has much
larger evaluation parameter space then checkers, but todays
computers have much more memory (for tables) and speed (for
self-play tuning of these larger tables) than those of 1950s,
so I think it would be feasable to implement such scheme in the
present day chess programs (instead of hand-tuning that chess
programmers do).
This nonlinearity as a cause (of positional evaluation
worsening with depth) is in fact the same thing as your first
explanation. The second cause you mention, the tuning of the
evaluation functions to the non-representative sample of
positions (to the human-style positions), since at larger
depths the program will encounter more of the "obscure"
postions, is probably contributory, as well.
The error propagation is a phenomenon of wider scope than these
other specific ones. While those specific causes occur at the
leaf nodes (due to imperfections in the evaluation function or
approximations used), the error propagation is a property of
the overall decision algorithm (minimaxing with inaccurate leaf
node values, whatever the causes of the leaf innacuracies may
be). The core of this decision algorithm is effectively a
knockout style "tournament" among the candidate leaf nodes
(except that instead of a binary tree of the common knockout
tournament, this one has a tournament tree with whatever the
effective branching factor of the program is), where a winner
from each round rises up to the next level to compete with the
other winners from the previous rounds, while losers are
eliminated from the competition.
If there is a clear cut strongest candidate (such as early
termination node or a decisive material gain), this method
works well. But, like in human tournaments of this kind in any
sport, if you have roughly equally matched candidates, the
outcome in each step is a tossup. In such situations in human
knockout torunaments, repeating the tournament would produce
another winner among the top players. In a chess program, any
tiniest change in the evaluation parameters, or the move
generator ordering (which affects how the preliminary sorting
will order the search in case of several equal preliminary
scores) or just searching at the next depth, will similarly
produce different "best" leaf node (and a different "best" move
from the root) in each run.
The full decision algorithm in such situations is basically
working off the evaluation function noise, jumping from one
meaningless random blip to another. Unlike the nonlinearity
problem, where the program has difficulty in comparing apples
with oranges, and which can be somewhat aleviated (by adding a
hand tuned scaling to evaluation terms, i.e. the conversion
factor, as it were, between apples and oranges), here it is
comparing small bits of apples with small bits of oranges,
mixed with bits of everything else, the greater the depth the
smaller the bits and greater the mix. So this is an intrinsic
problem with minimaxing over the inaccurate leaf values (or
which are inherently hard to compare between different kinds of
values), affecting negatively the play whenever a decision is
made purely on tallying the small unrelated positional "gains"
from larger depths.
The effect on the program is a lack of coherence. To a human
player this appears as if the program is very indecisive,
jumping from one plan to another on every move. In fact the
program doesn't have a plan other than, in the absence of clear
tactical gain within its horizon, to maximize the sum of such
small gains, and wait for a tactical opportunity to somehow
happen. While a human player may pursue some positional gain as
a part of a specific kind of plan (as suitable to such type of
positional gain) aimed at converting this gain into a decisive
material gain, the program is trying to maximize the sum of
such gains, which due to noise in the small (positional)
evaluation differences, gives the appearance to the human
player as if the program is pursuing plans, but different ones,
often mutually exclusive or unrelated (unsupportive of each
other), undoing each other on each move. The lack of coherence
from move to move precludes program from adding these small
gains in the same direction, as it were, which would allow them
to amount to something convertable to a more tangible gain.
What program would benefit from in such situations is a method
for breaking this symmetry (the virtual equality between
different types of gains). In a more sophisticated algorithms,
with multilevel control, this would be achieved by actual plans
being constructed and the attractivness of one kind of
positional gain might increase over the others, depending on
how well it fits to (or supports) a given plan. With the
regular alpha-beta searchers, where the control logic is flat,
one may break symmetry by superficially dropping the terms for
all but the most significant types of gains (as discovered
during initial iterations), i.e. by stripping the positional
evaluation down, simplifying it (or reducing the number of
discrete values each term may produce, i.e. quantizing) as the
depth increases. Another good symmetry breaking device is
localization, i.e. the small positional gains may be
counted/scaled differently if occuring at different parts of
the board (the scale factors would be adjusted during the
search, as resalt of earlier iterations), introducing thus a
sense of mobilization and concentration of forces toward a
localized objectives.
This symmetry breaking in the regular (flat control) searchers
would act in a similar way that districting and hiararchy act
in the political voting systems, i.e. if a president (or a
party) were elected with a single voting added together over
entire country, the outcomes would often end up in close races,
thus giving a president a weak mandate. To get around this
problem (which like similar positional evaluations, occurs when
the two candidates are almost equally attractive to the
public), the whole electorate is subdivided into voting
districts and a winner in one district (by however small
majority) ends up carrying the entire district forward, as if
it had won the district by a 100% vote. That way even the
slightest preference for one candidate magnifies his victory
into a landslide. Adding more than two levels of "quantizing"
(districting) magnifies the victory even further.
In the decision algorithms with multilevel control, the "fit to
plan" criteria breaks the symmetry by initiating the
mini-searches from each higher level node under specific
constraints, thus containing a custom evaluation function made
for that node and that particular search only. The same high
level node may serve again as the root for other searches with
different evaluation function (produced by another plan being
examined). In this scheme the decisive (top level) comparisons
are made not between the leaf node values but between the whole
plans, each in turn being allowed to unfold in as single-minded
way as it "wishes" and to report what best it can do when given
the full resources (the custom evaluation function all to its
own needs).
The flat-level searchers could simulate somewhat this behaviour
by redoing the searches from the root with different evaluation
functions (e.g. weighed differently for different areas of the
board, say, a queenside, center, kingside, back ranks space
control, or pure tactical gain search etc.). While the search
may be shallower, one would get an effect as if several
different programs were set off to work on the same position
and the net results from each were compared. Since these
narrowed searches would be much more constrained, thus prune
more efficiently than the general "search for enything," they
need not necessarily run N times slower for N searches (for the
same reason that the constrained search for, say, a checkmate
can run several times faster than a general search could to the
same depth).
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.