Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: 0x88 findings

Author: h.g.muller

Date: 15:16:04 02/22/06

Go up one level in this thread


By the way, this discussion brings to my attention that it is possible to make a
very fast move generation that is virtually branchless, on the philosophy:
"branches that are not in your code can't be mis-predicted".

The donside of this philosophy is that every iteration of the move-generator
loop would always (conditionally) have to do all things that normally the three
loops of your move generator (over pieces, directions and squares along the ray)
do: get the next piece, take the new direction step-vector from a table,
initialize the to-square to the position of the piece, step the to-square, look
hat is on it. But if you realize that every mispredicted branch costs you 11
cycles, and that you can do 3 intructions per cycle, this is a very small price
indeed.

The principal data structure would in addition to a board, which contains piece
numbers, and a piece list, which contains square numbers, would be a list of
rays. For every ray of every piece there would be a separate entry in this list,
i.e. a full complement of pieces would have 8x3 rays for the Pawns, 4x8 rays for
KQNN and 4x4 rays for RRBB, 72 in total for each side. The single loop of the
move generator would step through this list, thus servicing all direction of all
pieces. The ray list would be a linked list, every entry specifying its
successor, so that captured pieces can be easily removed.

For ray number i, piece[i] would contain the piece (its number) to which the ray
belongs, vec[i] the step on the board for the direction the ray runs in, and
mask[i] would contain status bits that tell things like if this is a ray of a
sliding piece, if captures along this ray are allowed, or if normal moves are
allowed. All this is initialized once and for all during startup. next[i] would
contain the link to the next ray, initialied as ray+1, and updated by DO_MOVE /
UNDO_MOVE on captures to remove the rays of the no-longer-existing piece.

The piece numbers would be chosen such that the color of the piece can be easily
deduced from the number, like 0x10-0x1f for white, and 0x20-0x2f for black.
Empty squares on the board are indicated by 0x40, inaccessable guard-band
squares by 0x80. All variables are single bytes (unsigned char).

The move generation relies heavily on the exitence of conditional move
istructions CMOV, that allow you to do a conditional assignment 'if(A)B=C;'
without using a branch instruction. These instructions perform the assignment or
not, based on the condition, but don't alter the flow of control, and thus don't
involve branch prediction.

The move generator would look like this:

ray = 0;                       /* first ray */
new = 1;                       /* is new one */
do{
    p = piece[ray];            /* piece to which this ray belongs */
    from[msp] = pos[p];        /* put move on move stack */
    if(new) y = pos[p];        /* from-square */
    step = vec[ray];           /* board step for this ray */
    y += step;                 /* step to-square along ray */
    to[msp] = y;               /* put move on move stack */
    invalid = (side_to_move ^ board[y]) & mask[ray];
                               /* check if contents of to-square compatible with
this direction */
    if(!invalid) msp++;        /* accept move */
    new = invalid | mask[ray]&NON_SLIDING;
                               /* end of old ray reached if move was already
invalid, or non-sliding piece */
    if(new) ray = next[ray];   /* conditionally go to next ray */
}while(ray);                   /* until all rays done */
/* All normal moves now stacked */

The loop needs 5 loads from memory tables (3 ray list 1 board, and 1 piece
list), 2 stores to the move stack, and 3 condtional moves. If I count right
there are 10 ALU instructions if we can fit everything in the registers. At the
end there will be a 'take-always' branch, which is only mispredicted once, at
the end of the loop. Most current machines can do 1 or 2 memory ops per clock,
and 2 or 3 ALU ops. So they should be able to run an assembler-optimized version
of this loop in 5-7 CPU clocks. At 2 GHz that is 3,5 nsec per move, or ~300
million moves per second. Ray list, piece list and board will aways reside in
cache, together they occupy about 1KB.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.