Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: High performance move generation

Author: Ratko V Tomic

Date: 08:37:50 11/19/99

Go up one level in this thread


>> instruction (which are more expensive than straight fall through).
>> In assembly language one can easily code this so that
>> (piece,source_square) pair of values retreives a pointer to a code
>> label which has the right number of directions and steps each direction
>> coded to execute without jumps or loops and to change the direction
>> (on default end of direction), all in a single block of a fall through
> >code.
>
> I think you can as well do this in C with the good old goto?

C doesn't have a FORTRAN style "computed goto." The closest thing it has is a
switch statement, but its shortcomings vs the assembly labguage are:

1) You have to code each "case" as separate statement. What do you do if you
have, say, 12*64=768 labels (or even more than 12 piece types if you add castle
capable & ep subtypes)? Type several thousands by hand? In assembler you make an
iterating (by count) macros of the handful basic types (12 or so) and the
dispatch table which allows jumping into the right place inside the repeated
macro block which ahs no explicit labels.

2) The switch statement will at best (even if you somehow type in [using C
macros and editor's cut&paste] few hundreds or thousands cases) perform doubly
indirect jump to the case statement, i.e. a (square,piece) pair will first
retrieve a number which would be used as an index into the array (built
implicitly by the compiler, without guarantee it will do so at all, and without
any control from you) of code addresses for different case statements (provided
it doesn't overflow its "switch" statement compiling buffers on switch with few
thousands case statements). Our assembly language table is indexed directly by
the piece type and square (fetched as two adjecent bytes without shifting the
bit patterns togehter) addressing thus a table of 12*256 = 3072 entries, each
being a direct address of a place in the repeated macro block where to jump; the
whole switchover (including termination at the end of the list) is then:

  LODSW            ;get next (square,piece) pair of bytes
  JMP  [EBX+EAX*4] ;dispatch to "this piece on this square" routine
                   ;ebx has an address of the 3072 dword table

If square can be represented (which it can) in the piece-square array as a
4*square (i.e. as 0, 4, 8, 12,..) then the dispatch table can be reduced to the
3072 byte table (instead of 3072 dwords) and the jump above could be of the
quicker form JMP [EBX+EAX] (i.e. without index scaling opcodes).

3) This large C switch statement, will produce code which checks the range of
the switch argument, to verify it falls within the dispatch array boundaries
(since this argument is a variable, picked out of an array, via computed index
itself picked from another array) and compiler doesn't know its inherent limits.
While on some compilers (of some versions) you can use pragmas to help it avoid
some checking, the code loses portability (e.g. on the same platform, say
Intel's processor, an assembly routine above would be portable between compilers
or versions of compilers, as a linked OBJ module or DLL, while the C code with
version specific pragmas would not be.)

4) The register variable allocation across such large switch statement, with
fall through and branching subcases, would suffer in the compiled code. The
assembly code could have a well planned policy on register usage (especially the
byte registers, which are more numerous) based on knowing which data matter,
which says the same and which can change.

5) If one needs to separate captures from non-captures (as one normally does),
the move generator is the most natural and economical place to do it. In that
case the assembly code wouldn't need the two pointers for the two output lists
(thus wasting an extra register variable, assuming it can even get to the point
where it would pick these pointers for register variables), but could push one
type of the moves on the stack (using single instruction PUSH ECX) and the other
type into an array (via single isntruction STOSW or STOSD). They could be
processed there (on the stack), if stack was temporarily redirected into the
right array, or in the worst case, they could be bulk-copied (which is quick) to
the final destinaton, before the stack gets reset.

6) Since the entire "dispatch to the next piece" assembly code is quite short, 2
 assembly instructions, one could place dispatch to next at the end of each
(piece,square) routine, reducing the extra jump to common dispatch point for
each (piece,square) item. C switch can't do that at all (i.e. one would need to
replicate the entire switch inside each "case" statement just to avoid this
extra jump for one (piece,square) item from the list, then after the 2nd item it
would have to take the extra jump to the top level switch).


Overall, for this kind of task, a good assembly code would be 4-5 times faster
than the best C code/compiler could do, it would be much easier to code (less
repetitive and error prone, due to the hundreds manually entered "case" labels
in C) and it would be more portable across the compilers and languages on the
same CPU type (like the dominant Intel CPUs, with the dominant Win32 DLL API).


> Anyway I finally became convinced that the importancy of the speed of
> the movegenerator is modest, unless you want to make a Fritz.

Well, it could benefit in another extreme, too, at the opposite end of the
spectrum from Fritz. E.g. a design where a program uses two levels of search
tree (similar to Botvinik's Pioneer scheme), one high level tree guided by the
chess knowledge and plans, which is a sparse & deep human-like tree, and the
numerous other low level trees produced by a fast alpha-beta searcher to "hop,"
as it were, between the nodes of the high level tree (the low level seracher is
additionally constrained in each such "bridging" search by the specific
goals/targets set by the higher level code for that high level node serving as
the root of that low level search).





This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.