Author: Anthony Cozzie
Date: 17:59:41 07/17/04
Go up one level in this thread
On July 17, 2004 at 16:14:10, Gerd Isenberg wrote: >Got it - but count the cycles by yourself. > >With some 90-degree rotation trick of the 64-byte weight array, the task of a >dot product is rather easy with AMD64 SSE2 instructions. 31 instructions, only >134 bytes for that routine in 32-bit mode. Most of the time four independent >instruction chains. > >Considering the bitboard inside the lower half of a 128-byte xmm-register, a >punpcklqdq xmm,xmm instruction (Unpack and Interleave Low Quadwords) copies it >into the upper half. > >With three further 128-bit movdqa we have now eight copies of the bitboard in >four xmm-registers. Now the eight bitboards are anded with: > > 0x0202020202020202:0x0101010101010101 > 0x0808080808080808:0x0404040404040404 > 0x2020202020202020:0x1010101010101010 > 0x8080808080808080:0x4040404040404040 > >to mask each bit bytewise, that is only one bit per byte. > >The 90-degree rotation of the resulting byte vector and therefore the weight >array too, is because 0x0101010101010101 masks a1,a2,a3...a8, while >0x8080808080808080 masks h1..h8. > >The next step is to use the "Packed Compare Equal Bytes" instruction with zero >to get 0 and -1 masks. Since equal zero produces 0xff and unequal zero produces >0x00, the result is complemented for our purpose. Fortunately there is a >pandn-instruction. The result is an 64-byte {0|-1}-vector with >a1,a2,a3...a8,b1...h8. > >The rest was already mentioned, four further ands with the appropriate mapped >weight vector, then adding all bytes together to one 16 bit word. > >Cheers, >Gerd > > >//========================================================= >// some sample source, i tried with MSC6 and inline assemby >//========================================================= > >#include <stdio.h> > >typedef unsigned __int64 BitBoard; >typedef unsigned char WType; > >#define CACHE_LINE 64 >#define CACHE_ALIGN __declspec(align(CACHE_LINE)) > > >struct SMaskConsts >{ > BitBoard C01; BitBoard C02; > BitBoard C04; BitBoard C08; > BitBoard C10; BitBoard C20; > BitBoard C40; BitBoard C80; >}; > >const SMaskConsts CACHE_ALIGN MaskConsts = >{ > 0x0101010101010101, 0x0202020202020202, > 0x0404040404040404, 0x0808080808080808, > 0x1010101010101010, 0x2020202020202020, > 0x4040404040404040, 0x8080808080808080, >}; > >int dotProduct(BitBoard bb, WType* weights) >{ > __asm > { > movq xmm0, [bb] ; get the bb in lower > lea eax, [MaskConsts] > punpcklqdq xmm0, xmm0 ; .. and upper half > pxor xmm7, xmm7 ; zero for compare and byte diff > movdqa xmm1, xmm0 > movdqa xmm2, xmm0 > movdqa xmm3, xmm0 > pandn xmm0, [eax] ; mask the consecutive bits > pandn xmm1, [eax+16] > pandn xmm2, [eax+32] > pandn xmm3, [eax+48] > mov eax, [weights] > pcmpeqb xmm0, xmm7 ; build the -1|0 byte masks > pcmpeqb xmm1, xmm7 > pcmpeqb xmm2, xmm7 > pcmpeqb xmm3, xmm7 > > pand xmm0, [eax] ; and them with the weights > pand xmm1, [eax+16] > pand xmm2, [eax+32] > pand xmm3, [eax+48] > psadbw xmm0, xmm7 ; horizontal adds > psadbw xmm1, xmm7 ; bytes to word > psadbw xmm2, xmm7 > psadbw xmm3, xmm7 > paddd xmm0, xmm1 ; vertical add > paddd xmm2, xmm3 > paddd xmm2, xmm0 > movdqa xmm0, xmm2 ; finally add high and low quad words > punpckhqdq xmm0, xmm2 ; like shift right > paddd xmm0, xmm2 > movd eax, xmm0 ; expensive to move in gp-register! > } >} > >int main(int argc, char* argv[]) >{ > WType CACHE_ALIGN weights[64] = > { > 1, 2, 3, 4, 4, 3, 2, 1, // a1..a8 > 2, 3, 4, 5, 5, 4, 3, 2, // b1..b8 > 3, 4, 5, 6, 6, 5, 4, 3, // c1..c8 > 4, 5, 6, 7, 7, 6, 5, 4, // d1..d8 > 4, 5, 6, 7, 7, 6, 5, 4, // e1..e8 > 3, 4, 5, 6, 6, 5, 4, 3, // f1..f8 > 2, 3, 4, 5, 5, 4, 3, 2, // g1..g8 > 1, 2, 3, 4, 4, 3, 2, 1, // h1..h8 > }; > > int bonus = dotProduct(0xfedcba9876543210, weights); > return bonus; >} I am guessing something like 50 cycles? Really not that bad . . . probably close to the speed of a scan over attack tables. anthony
This page took 0.03 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.