# Computer Chess Club Archives

## Messages

### Subject: A SIMD idea, eg. Piece/Gain of a capture target

Author: Gerd Isenberg

Date: 10:20:18 01/21/04

```Hi, all interested in bitboards and SIMD-instructions ;-)

With bitboards the usual, obvious way to get the piece/gain of a (single)
capture target, is bitscan and two lookups:

square = bitScan(t);
piece  = board[square];
gain   = pieceValue[piece];

If you (like me) intend to avoid square-centric at all (bitscan and incremental
updated board array), you may either traverse disjoint target sets, or you may
do something like this:

// precondition: t is single member of opposite pieces (but no king)

if      ( t & Pawns )  {piece = pawn;    gain = ValPawn;}
else if ( t & Rooks )  {piece = rook;    gain = ValRook;}
else if ( t & Knights ){piece = knight;  gain = ValKnight;}
else if ( t & Bishops ){piece = bishop;  gain = ValBishop;}
else /* queen */       {piece = queen;   gain = ValQueen;}

For this purpose or similar branch structures, i figured out following
SSE2-"algo" based on "parallel compare", "parallel and" in conjunction with
"parallel Sum of Absolute Differences", i like to introduce here:

----------------------------------------------------------------------------
For all targets (eg. once per capture generation):

To perform a single 32-bit compare between masked high and low-board,
it is favorably to have four piece-bitboards
in a high/low separated (xmm-register) representation:

xmm6 ::= <ql,rl:bl,kl>  -  LowBoard of queens,rooks,bishops and knights
xmm7 ::= <qh,rh:bh,kh>  - HighBoard of queens,rooks,bishops and knights

Some piece related 16-bit constants (see below):

xmm4 ::= <MQ16,MR16,MB16,MK16: 4,3,2,1>  // 16-bit word wise
xmm5 ::= <SQ16,SR16,SB16,SK16: 4,3,2,1>

----------------------------------------------------------------------------
For each capture target:

xmm0 ::= <0:target> == <0,0:th,tl> - single target bit

// unpack the target
PUNPCKLDQ  xmm0, xmm0 ; <th,th:tl,tl>
MOVDQA     xmm1, xmm0
PUNPCKLQDQ xmm0, xmm0 ; <tl,tl:tl,tl>
PUNPCKHQDQ xmm1, xmm1 ; <th,th:th,th>

// perform the ands with piece sets
PAND       xmm0, xmm6 ; <tl,tl:tl,tl> &= <ql,rl:bl,kl>
PAND       xmm1, xmm7 ; <th,th:th,th> &= <qh,rh:bh,kh>

// compare high-low, which is equal for (three or four) empty sets
PCMPEQD    xmm0, xmm1 ; <!Q32,!R32:!B32,!K32>
PACKSSDW   xmm0, xmm0 ; <!Q16,!R16,!B16,!K16:!Q16,!R16,!B16,!K16>

// mask and sum absolute byte difference to get the final values
PAND       xmm0, xmm4 ; piece specific mask
PSADBW     xmm0, xmm5 ; < gain/2 : pieceCode >

----------------------------------------------------------------------------

The <4,3,2,1> constants in the lower quad word of xmm4,xmm5
leave the following codes:

0 - Pawn, 1 - Knight, 2 - Bishop, 3 - Rook, 4 - Queen

To get the (half of) value of a piece, it is necessary to have some other
constants, to performe the "and" and "Packed Sum of Absolute Differences of
Bytes" trick:

Mi16 = 256*Mi8 + Mi8; a sequence of two bytes
Si16 = 256*Si8 + Si8; a sequence of two bytes

Mi8 = (2*ValuePiece[i] -   ValuePawn + 3) div 8
Si8 = (4*ValuePiece[i] - 3*ValuePawn + 7) div 16

Because the max of absolute byte difference is 255, and "only" two byte
differences are added per piece, the maximum value is 510, therefore gain/2 is
computed, with value of a queen <= 1020.

Similar code is possible to consider SEE-like values for that target,

Some further comments on the instructions: Unfortunately there is one 4-cycle
vector path instruction on AMD64, PACKSSDW. I found no fast replacement for
that. All other instructions are "double direct path" instructions, four bytes
each. PSADBW takes 4 cycles, all other 2 cycles.

The unpack sequence to build the <th,th:th,th> <tl,tl:tl,tl> vector may be
computed with two PSHUFD instead of four unpack/mov-instructions, but PSHUFD is
4-cycle vector path too.

Comments, suggestions and improvements welcome of course.

Cheers,
Gerd

```