Author: Gerd Isenberg
Date: 10:20:18 01/21/04
Hi, all interested in bitboards and SIMD-instructions ;-) With bitboards the usual, obvious way to get the piece/gain of a (single) capture target, is bitscan and two lookups: square = bitScan(t); piece = board[square]; gain = pieceValue[piece]; If you (like me) intend to avoid square-centric at all (bitscan and incremental updated board array), you may either traverse disjoint target sets, or you may do something like this: // precondition: t is single member of opposite pieces (but no king) if ( t & Pawns ) {piece = pawn; gain = ValPawn;} else if ( t & Rooks ) {piece = rook; gain = ValRook;} else if ( t & Knights ){piece = knight; gain = ValKnight;} else if ( t & Bishops ){piece = bishop; gain = ValBishop;} else /* queen */ {piece = queen; gain = ValQueen;} For this purpose or similar branch structures, i figured out following SSE2-"algo" based on "parallel compare", "parallel and" in conjunction with "parallel Sum of Absolute Differences", i like to introduce here: ---------------------------------------------------------------------------- For all targets (eg. once per capture generation): To perform a single 32-bit compare between masked high and low-board, it is favorably to have four piece-bitboards in a high/low separated (xmm-register) representation: xmm6 ::= <ql,rl:bl,kl> - LowBoard of queens,rooks,bishops and knights xmm7 ::= <qh,rh:bh,kh> - HighBoard of queens,rooks,bishops and knights Some piece related 16-bit constants (see below): xmm4 ::= <MQ16,MR16,MB16,MK16: 4,3,2,1> // 16-bit word wise xmm5 ::= <SQ16,SR16,SB16,SK16: 4,3,2,1> ---------------------------------------------------------------------------- For each capture target: xmm0 ::= <0:target> == <0,0:th,tl> - single target bit // unpack the target PUNPCKLDQ xmm0, xmm0 ; <th,th:tl,tl> MOVDQA xmm1, xmm0 PUNPCKLQDQ xmm0, xmm0 ; <tl,tl:tl,tl> PUNPCKHQDQ xmm1, xmm1 ; <th,th:th,th> // perform the ands with piece sets PAND xmm0, xmm6 ; <tl,tl:tl,tl> &= <ql,rl:bl,kl> PAND xmm1, xmm7 ; <th,th:th,th> &= <qh,rh:bh,kh> // compare high-low, which is equal for (three or four) empty sets PCMPEQD xmm0, xmm1 ; <!Q32,!R32:!B32,!K32> PACKSSDW xmm0, xmm0 ; <!Q16,!R16,!B16,!K16:!Q16,!R16,!B16,!K16> // mask and sum absolute byte difference to get the final values PAND xmm0, xmm4 ; piece specific mask PSADBW xmm0, xmm5 ; < gain/2 : pieceCode > ---------------------------------------------------------------------------- The <4,3,2,1> constants in the lower quad word of xmm4,xmm5 leave the following codes: 0 - Pawn, 1 - Knight, 2 - Bishop, 3 - Rook, 4 - Queen To get the (half of) value of a piece, it is necessary to have some other constants, to performe the "and" and "Packed Sum of Absolute Differences of Bytes" trick: Mi16 = 256*Mi8 + Mi8; a sequence of two bytes Si16 = 256*Si8 + Si8; a sequence of two bytes Mi8 = (2*ValuePiece[i] - ValuePawn + 3) div 8 Si8 = (4*ValuePiece[i] - 3*ValuePawn + 7) div 16 Because the max of absolute byte difference is 255, and "only" two byte differences are added per piece, the maximum value is 510, therefore gain/2 is computed, with value of a queen <= 1020. Similar code is possible to consider SEE-like values for that target, Some further comments on the instructions: Unfortunately there is one 4-cycle vector path instruction on AMD64, PACKSSDW. I found no fast replacement for that. All other instructions are "double direct path" instructions, four bytes each. PSADBW takes 4 cycles, all other 2 cycles. The unpack sequence to build the <th,th:th,th> <tl,tl:tl,tl> vector may be computed with two PSHUFD instead of four unpack/mov-instructions, but PSHUFD is 4-cycle vector path too. Comments, suggestions and improvements welcome of course. Cheers, Gerd
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.