Author: Ratko V Tomic
Date: 15:01:50 11/21/99
Go up one level in this thread
> Don't get me wrong, I'm not objecting to that there is a negative effect > with increased depth, I was simply questioning the explanations you > gave ("increased chance of choosing the incorrect branch" and "the > uncertainty increases as the values back up the tree"). They didn't > sound right to me. Great deal of the apparent disagreement here is due to the fuzzy conceptual boundaries in the conventional terminology between the "true chess game" (TCG) and the model of the chess game (MCG) that chess program creates. The TCG and MCG, while similar, are not the same game. Namely, the TCG is a structure which contains entire chess tree, with exact value 1, 1/2 and 0 at each leaf node and the minimax rule of value propagation (defining it thus as a 2 player game) which assigns to each inner node a minimaxed value (1, 1/2 or 0) and a ply count to the nearest (in minimax traversal sense) leaf node having the same value. The TCG structure, representing a finite game, is therefore a static tree, like a preinitialized array of constants in memory, frozen in time, with no dynamics or events. The MCG is a scaled model that a chess program builds internally in an attempt to simulate TCG. The MCG structure always contains a skeletal tree, which in terms of the nodes (board positions) and the connectivity (legal moves), as far as it goes, is identical to the corresponding skeletal tree fragment of the TCG. But the similarity ends at that skeletal identity. The node values, which are the essential defining component of a full definition of a game, are different between the TCG and MCG. Of course, we can easily scale down the MCG extremal values to match the TCG values 1, 1/2 and 0. But even after this scaling to emphasize similarity, the MCG node values take not just the 3 values that TCG nodes do, but the entire spectrum of values in between the three. One may try to get around this problem by "quantizing" the spectrum of MCG values, and assign to each MCG node the nearest of the 3 values 1, 1/2 or 0. But then, even though the MCG tree will now have the 3 value numbers on each node, just like the TCG, these numbers won't be same as those of the TCG, i.e. even this form of the "quantized" MCG defines a different complete game from the TCG. The MCG will also be lacking the "minimum distances to the leaf" numbers (as defined by the TCG minimax rule & its leaf values), since program's algorithm can't produce these correctly (other than exceptionally, in the tablebase case or partially, i.e. only for some nodes reachable from the root, in the forced mate or draw cases). So we're dealing here with two different games, sharing the same skeleton tree (as far as the MCG builds it), but quite different in the remaining key aspects which make up the full definition of a game. While MCG mimicks the minimax motions of the TCG, the numbers (values) it is playing with are from a fictitious game of its own creation. Additionally, unlike the TCG, which is a static, frozen in time, structure, the MCG is a dynamic structure in constant creation and destruction of the node values and the tree fragments which are being brought out into the program's searchlight focus, only to sink back, microseconds later, into the surrounding darkness. Unlike the fixed 3 values of TCG at each node, the value pointer that MCG has on each node oscillates as its computations continue. While the wild pointer swings occur mainly in the initial iterations, the small (0.1-0.2 pawns) oscillations persist throughout the computation, coming from the positional terms which bring in (report, notice) different things from different depths and different branches being examined. This dynamical system nature of MCG justifies the use of terms such as forces or spreading or propagation or oscillations. The additional judgment loaded attributes (positive/negative, error, worsening, etc) we attach to those dynamical terms refer to how close to the TCG values the MCG values come and whether their change brings them closer of farther from the TCG values. If a change in some dynamics defining parameter within the MCG (say the D parameter to D+1) causes the MCG values to come closer to the TCG values, we can characterize this as a positive force (on the MCG value pointer) induced by the D->D+1 change within that fragment of the skeletal tree. But in another fragment of the tree, the change of D->D+1 will drive the MCG value pointer away from the corresponding TCG value. Here we can say that the D->D+1 change induces negative forces on the MCG value pointer. (Similarly, when we talk about the tuning of the opening book to the program, we can look at it as redirecting or restricting the MCG movements away from the tree regions where the program has earlier encountered the negative forces.) Now, all that is meaningless if one is talking only about the TCG (in fact, when talking about chess programs and their properties, evaluations... etc, one is always talking about MCG). There is no dynamics or change of anything there. But MCG is a different game and such considerations are perfectly meaningful in that game. Furthermore, the MCG(D) is a slightly different game from the MCG(D+1). But since both are dynamical, we can speak of different forces acting on each, and since they do share even more common structure among themselves than either does with TCG, we can compare these forces and their directions (do they drive, and when, how and why, the MCG pointers toward or away from the corresponding 3 TCG values) at the same nodes in the shared skeleton. In that sense it is perfectly meaningful to observe that MCG(D+1) will at some significant fraction of the nodes experience forces (acting on its value pointer) which are more negative than those of MCG(D), and strongly enough so, to allow the MCG(D) to win 20% of the games. In some of those games won by the MCG(D), as you also suggested, both programs may have experienced the same negative force region (i.e. neither program knew they were blundering), and by pure luck this joint sleepwalking lead to the decisive gain by the MCG(D). But, it is also well known that in about 17 percent of cases, increasing depth changes the decision at the root, not necessarily to the better one (as the observation of the existence of the "joint sleepwalking" demonstrates). Although I haven't seen experiments where the new "best" moves (discovered by D+1 relative to D's "best" move) were classified in the more objectively better/worse categories, I would expect that in at least 10 percent of cases, (D+1)'s new choice would be objectively worse than the D's choice. And if, in order to separate better different causes, one discards the cases when D+1 had made a better choice for obvious reason, i.e. by seeing tactical gain not visible at depth< D+1, among the remaining cases the D ply program can do only better since the tactical vision gain is always in favor of the D+1 program. Human chess players have, of course, each their own internal model of the TCG, a personal MCG, different from programs' MCGs and from the MCGs of other human players. Looking at the larger picture, the internal scale model building and running forward such models to examine consequences of different moves before making a decision in the "real" world is a fundamental algorithm, a modus operandi, of any adaptive system, from bacteria (or even more primitive subsystems, the genetic engines) through humans, or even larger systems (human organizations, societies). It is cheaper and safer to try out the responses first in the internal model world than it is to blunder in the unforgiving (no take-back) "real" world. For almost all of our moment to moment activity this modelling occurs without our conscious attention. The models are so ingrown into our thinking that we often forget (and some never realize) that what any of us calls and thinks of as the "real" world is an internal "toy" world, as it were, that each brain makes up all on its own, by picking out and remembering the innumerable patterns, regularities and correlations among the electric pulses that reach it. It's therefore not surprising that we make the chess programs in the same basic mold. > On the other hand, I think you might be right about that > there exists such negative effects (although observing that a > D-ply deep search occasionally wins a D+1 search is not as > such an indicator that there are). I do agree with you, that > the quality of the evaluation function can worsen with > increased depth, and I think this is a far more plausible > explanation. Well the wins od D vs D+1 are an _indicator_ (i.e. a phenomenon consistent with, even suggestive of existence) of negative forces which appear with increased depth, not itself an _explanation_. The score isn't an explanation or a cause but an effect. As one of several possible explanations (or causes) I suggested (as a generalization of the specific mechanism Bob suggested) the nonlinearity of the positional evaluation function, which makes the comparisons between positions which are farther apart, be it in space/locale of action or time, (as they would often be with greater depth) less reliable than comparisons among only slightly different positions (as they are at the shallower depths), since the linear approximation (the adding up of terms) is generally more reliable for smaller perturbations. (In continuous domain one can expand analytical function as f(x,y) = f(x0,y0) + A1*dx + B1*dy + A2*dx^2 + C11*dx*dy + B2*dy^2 +...; for linear function A2, A3,.. B2,B3, C11,... are all 0 so the value is exactly the sum of the [rescaled] separate perturbations dx, dy; for "mildly" nonlinear f(x,y), A2, B2,... are small and the first order sum works well, too, provided dx, dy are "small", otherwise the higher order terms with dx^2 etc start having an effect. The depth tuning that Bob mentioned apparently would still keep the linear form (i.e. add up the gains), but it would rescale A1 and B1 so that the plane slope would match the slope of the nonlinear surface better. While this apprach can fix one type of problems observed, it can worsen others, making thus the programs play more unstable.) The Samuels' checkers learning algorithm took this into account when he didn't merely add up various separate positional gains, but had used lookup tables indexed by the multiple types of positional values and giving the result which didn't have to be a sum of the index values (the program tuned this non-linear mapping via the self-play). Now, it is true that chess has much larger evaluation parameter space then checkers, but todays computers have much more memory (for tables) and speed (for self-play tuning of these larger tables) than those of 1950s, so I think it would be feasable to implement such scheme in the present day chess programs (instead of hand-tuning that chess programmers do). This nonlinearity as a cause (of positional evaluation worsening with depth) is in fact the same thing as your first explanation. The second cause you mention, the tuning of the evaluation functions to the non-representative sample of positions (to the human-style positions), since at larger depths the program will encounter more of the "obscure" postions, is probably contributory, as well. The error propagation is a phenomenon of wider scope than these other specific ones. While those specific causes occur at the leaf nodes (due to imperfections in the evaluation function or approximations used), the error propagation is a property of the overall decision algorithm (minimaxing with inaccurate leaf node values, whatever the causes of the leaf innacuracies may be). The core of this decision algorithm is effectively a knockout style "tournament" among the candidate leaf nodes (except that instead of a binary tree of the common knockout tournament, this one has a tournament tree with whatever the effective branching factor of the program is), where a winner from each round rises up to the next level to compete with the other winners from the previous rounds, while losers are eliminated from the competition. If there is a clear cut strongest candidate (such as early termination node or a decisive material gain), this method works well. But, like in human tournaments of this kind in any sport, if you have roughly equally matched candidates, the outcome in each step is a tossup. In such situations in human knockout torunaments, repeating the tournament would produce another winner among the top players. In a chess program, any tiniest change in the evaluation parameters, or the move generator ordering (which affects how the preliminary sorting will order the search in case of several equal preliminary scores) or just searching at the next depth, will similarly produce different "best" leaf node (and a different "best" move from the root) in each run. The full decision algorithm in such situations is basically working off the evaluation function noise, jumping from one meaningless random blip to another. Unlike the nonlinearity problem, where the program has difficulty in comparing apples with oranges, and which can be somewhat aleviated (by adding a hand tuned scaling to evaluation terms, i.e. the conversion factor, as it were, between apples and oranges), here it is comparing small bits of apples with small bits of oranges, mixed with bits of everything else, the greater the depth the smaller the bits and greater the mix. So this is an intrinsic problem with minimaxing over the inaccurate leaf values (or which are inherently hard to compare between different kinds of values), affecting negatively the play whenever a decision is made purely on tallying the small unrelated positional "gains" from larger depths. The effect on the program is a lack of coherence. To a human player this appears as if the program is very indecisive, jumping from one plan to another on every move. In fact the program doesn't have a plan other than, in the absence of clear tactical gain within its horizon, to maximize the sum of such small gains, and wait for a tactical opportunity to somehow happen. While a human player may pursue some positional gain as a part of a specific kind of plan (as suitable to such type of positional gain) aimed at converting this gain into a decisive material gain, the program is trying to maximize the sum of such gains, which due to noise in the small (positional) evaluation differences, gives the appearance to the human player as if the program is pursuing plans, but different ones, often mutually exclusive or unrelated (unsupportive of each other), undoing each other on each move. The lack of coherence from move to move precludes program from adding these small gains in the same direction, as it were, which would allow them to amount to something convertable to a more tangible gain. What program would benefit from in such situations is a method for breaking this symmetry (the virtual equality between different types of gains). In a more sophisticated algorithms, with multilevel control, this would be achieved by actual plans being constructed and the attractivness of one kind of positional gain might increase over the others, depending on how well it fits to (or supports) a given plan. With the regular alpha-beta searchers, where the control logic is flat, one may break symmetry by superficially dropping the terms for all but the most significant types of gains (as discovered during initial iterations), i.e. by stripping the positional evaluation down, simplifying it (or reducing the number of discrete values each term may produce, i.e. quantizing) as the depth increases. Another good symmetry breaking device is localization, i.e. the small positional gains may be counted/scaled differently if occuring at different parts of the board (the scale factors would be adjusted during the search, as resalt of earlier iterations), introducing thus a sense of mobilization and concentration of forces toward a localized objectives. This symmetry breaking in the regular (flat control) searchers would act in a similar way that districting and hiararchy act in the political voting systems, i.e. if a president (or a party) were elected with a single voting added together over entire country, the outcomes would often end up in close races, thus giving a president a weak mandate. To get around this problem (which like similar positional evaluations, occurs when the two candidates are almost equally attractive to the public), the whole electorate is subdivided into voting districts and a winner in one district (by however small majority) ends up carrying the entire district forward, as if it had won the district by a 100% vote. That way even the slightest preference for one candidate magnifies his victory into a landslide. Adding more than two levels of "quantizing" (districting) magnifies the victory even further. In the decision algorithms with multilevel control, the "fit to plan" criteria breaks the symmetry by initiating the mini-searches from each higher level node under specific constraints, thus containing a custom evaluation function made for that node and that particular search only. The same high level node may serve again as the root for other searches with different evaluation function (produced by another plan being examined). In this scheme the decisive (top level) comparisons are made not between the leaf node values but between the whole plans, each in turn being allowed to unfold in as single-minded way as it "wishes" and to report what best it can do when given the full resources (the custom evaluation function all to its own needs). The flat-level searchers could simulate somewhat this behaviour by redoing the searches from the root with different evaluation functions (e.g. weighed differently for different areas of the board, say, a queenside, center, kingside, back ranks space control, or pure tactical gain search etc.). While the search may be shallower, one would get an effect as if several different programs were set off to work on the same position and the net results from each were compared. Since these narrowed searches would be much more constrained, thus prune more efficiently than the general "search for enything," they need not necessarily run N times slower for N searches (for the same reason that the constrained search for, say, a checkmate can run several times faster than a general search could to the same depth).
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.