Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: High performance move generation

Author: Bas Hamstra

Date: 18:09:13 11/19/99

Go up one level in this thread


On November 19, 1999 at 11:37:50, Ratko V Tomic wrote:

>>> instruction (which are more expensive than straight fall through).
>>> In assembly language one can easily code this so that
>>> (piece,source_square) pair of values retreives a pointer to a code
>>> label which has the right number of directions and steps each direction
>>> coded to execute without jumps or loops and to change the direction
>>> (on default end of direction), all in a single block of a fall through
>> >code.
>>
>> I think you can as well do this in C with the good old goto?
>
>C doesn't have a FORTRAN style "computed goto." The closest thing it has is a
>switch statement, but its shortcomings vs the assembly labguage are:

>1) You have to code each "case" as separate statement. What do you do if you
>have, say, 12*64=768 labels (or even more than 12 piece types if you add castle
>capable & ep subtypes)? Type several thousands by hand? In assembler you make an
>iterating (by count) macros of the handful basic types (12 or so) and the
>dispatch table which allows jumping into the right place inside the repeated
>macro block which ahs no explicit labels.

Good point. Like Dan I wrote a metaprogram :) IE a codegenerator. But it took a
lot of cutting and pasting too and the results were somewhat disappointing. My
incremental pseudo-attacks are updated like this:

XORQueen()

switch(Square)
{ case 11:
  asm XOR [ESI+1], EBX
  asm XOR [ESI+2], EBX
  .
  .
  .
  case 12:
}

It is the fastest way I can think of, but you seem more like an assembly expert
than me.

>2) The switch statement will at best (even if you somehow type in [using C
>macros and editor's cut&paste] few hundreds or thousands cases) perform doubly
>indirect jump to the case statement, i.e. a (square,piece) pair will first
>retrieve a number which would be used as an index into the array (built
>implicitly by the compiler, without guarantee it will do so at all, and without
>any control from you) of code addresses for different case statements (provided
>it doesn't overflow its "switch" statement compiling buffers on switch with few
>thousands case statements). Our assembly language table is indexed directly by
>the piece type and square (fetched as two adjecent bytes without shifting the
>bit patterns togehter) addressing thus a table of 12*256 = 3072 entries, each
>being a direct address of a place in the repeated macro block where to jump; the
>whole switchover (including termination at the end of the list) is then:
>
>  LODSW            ;get next (square,piece) pair of bytes
>  JMP  [EBX+EAX*4] ;dispatch to "this piece on this square" routine
>                   ;ebx has an address of the 3072 dword table
>
>If square can be represented (which it can) in the piece-square array as a
>4*square (i.e. as 0, 4, 8, 12,..) then the dispatch table can be reduced to the
>3072 byte table (instead of 3072 dwords) and the jump above could be of the
>quicker form JMP [EBX+EAX] (i.e. without index scaling opcodes).

>3) This large C switch statement, will produce code which checks the range of
>the switch argument, to verify it falls within the dispatch array boundaries
>(since this argument is a variable, picked out of an array, via computed index
>itself picked from another array) and compiler doesn't know its inherent limits.
>While on some compilers (of some versions) you can use pragmas to help it avoid
>some checking, the code loses portability (e.g. on the same platform, say
>Intel's processor, an assembly routine above would be portable between compilers
>or versions of compilers, as a linked OBJ module or DLL, while the C code with
>version specific pragmas would not be.)
>
>4) The register variable allocation across such large switch statement, with
>fall through and branching subcases, would suffer in the compiled code. The
>assembly code could have a well planned policy on register usage (especially the
>byte registers, which are more numerous) based on knowing which data matter,
>which says the same and which can change.
>
>5) If one needs to separate captures from non-captures (as one normally does),
>the move generator is the most natural and economical place to do it. In that
>case the assembly code wouldn't need the two pointers for the two output lists
>(thus wasting an extra register variable, assuming it can even get to the point
>where it would pick these pointers for register variables), but could push one
>type of the moves on the stack (using single instruction PUSH ECX) and the other
>type into an array (via single isntruction STOSW or STOSD). They could be
>processed there (on the stack), if stack was temporarily redirected into the
>right array, or in the worst case, they could be bulk-copied (which is quick) to
>the final destinaton, before the stack gets reset.

Again like Dan, I use 2 separate functions to generate caps and noncaps. So this
argument isn't entirely convincing to me.

>6) Since the entire "dispatch to the next piece" assembly code is quite short, 2
> assembly instructions, one could place dispatch to next at the end of each
>(piece,square) routine, reducing the extra jump to common dispatch point for
>each (piece,square) item. C switch can't do that at all (i.e. one would need to
>replicate the entire switch inside each "case" statement just to avoid this
>extra jump for one (piece,square) item from the list, then after the 2nd item it
>would have to take the extra jump to the top level switch).
>
>
>Overall, for this kind of task, a good assembly code would be 4-5 times faster
>than the best C code/compiler could do, it would be much easier to code (less
>repetitive and error prone, due to the hundreds manually entered "case" labels
>in C) and it would be more portable across the compilers and languages on the
>same CPU type (like the dominant Intel CPUs, with the dominant Win32 DLL API).

On the whole very instructive and convincing. I thought there wasn't much you
could do in asm that you couldn't in C. But apparently there are some neat
things. Not that I am going to rewrite much in asm. I'd like to but argh, no
time...

>> Anyway I finally became convinced that the importancy of the speed of
>> the movegenerator is modest, unless you want to make a Fritz.
>
>Well, it could benefit in another extreme, too, at the opposite end of the
>spectrum from Fritz. E.g. a design where a program uses two levels of search
>tree (similar to Botvinik's Pioneer scheme), one high level tree guided by the
>chess knowledge and plans, which is a sparse & deep human-like tree, and the
>numerous other low level trees produced by a fast alpha-beta searcher to "hop,"
>as it were, between the nodes of the high level tree (the low level seracher is
>additionally constrained in each such "bridging" search by the specific
>goals/targets set by the higher level code for that high level node serving as
>the root of that low level search).

Nah :) Don't believe in that. Let's face it, brute force has won the race. And
what you are describing has holes. Still interesting, of course.


Regards,
Bas Hamstra.






This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.