Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Anthony Cozzie

Date: 17:59:41 07/17/04

Go up one level in this thread


On July 17, 2004 at 16:14:10, Gerd Isenberg wrote:

>Got it - but count the cycles by yourself.
>
>With some 90-degree rotation trick of the 64-byte weight array, the task of a
>dot product is rather easy with AMD64 SSE2 instructions. 31 instructions, only
>134 bytes for that routine in 32-bit mode. Most of the time four independent
>instruction chains.
>
>Considering the bitboard inside the lower half of a 128-byte xmm-register, a
>punpcklqdq xmm,xmm instruction (Unpack and Interleave Low Quadwords) copies it
>into the upper half.
>
>With three further 128-bit movdqa we have now eight copies of the bitboard in
>four xmm-registers. Now the eight bitboards are anded with:
>
>	0x0202020202020202:0x0101010101010101
>	0x0808080808080808:0x0404040404040404
>	0x2020202020202020:0x1010101010101010
>	0x8080808080808080:0x4040404040404040
>
>to mask each bit bytewise, that is only one bit per byte.
>
>The 90-degree rotation of the resulting byte vector and therefore the weight
>array too, is because 0x0101010101010101 masks a1,a2,a3...a8, while
>0x8080808080808080 masks h1..h8.
>
>The next step is to use the "Packed Compare Equal Bytes" instruction with zero
>to get 0 and -1 masks. Since equal zero produces 0xff and unequal zero produces
>0x00, the result is complemented for our purpose. Fortunately there is a
>pandn-instruction. The result is an 64-byte {0|-1}-vector with
>a1,a2,a3...a8,b1...h8.
>
>The rest was already mentioned, four further ands with the appropriate mapped
>weight vector, then adding all bytes together to one 16 bit word.
>
>Cheers,
>Gerd
>
>
>//=========================================================
>// some sample source, i tried with MSC6 and inline assemby
>//=========================================================
>
>#include <stdio.h>
>
>typedef unsigned __int64 BitBoard;
>typedef unsigned char WType;
>
>#define CACHE_LINE  64
>#define CACHE_ALIGN __declspec(align(CACHE_LINE))
>
>
>struct SMaskConsts
>{
>    BitBoard C01;    BitBoard C02;
>    BitBoard C04;    BitBoard C08;
>    BitBoard C10;    BitBoard C20;
>    BitBoard C40;    BitBoard C80;
>};
>
>const SMaskConsts CACHE_ALIGN MaskConsts =
>{
>    0x0101010101010101,	0x0202020202020202,
>    0x0404040404040404,	0x0808080808080808,
>    0x1010101010101010,	0x2020202020202020,
>    0x4040404040404040,	0x8080808080808080,
>};
>
>int dotProduct(BitBoard bb, WType* weights)
>{
>    __asm
>    {
>        movq	xmm0, [bb]      ; get the bb in lower
>        lea	eax,  [MaskConsts]
>        punpcklqdq xmm0, xmm0   ; .. and upper half
>        pxor	xmm7, xmm7	; zero for compare and byte diff
>        movdqa	xmm1, xmm0
>        movdqa	xmm2, xmm0
>        movdqa	xmm3, xmm0
>        pandn	xmm0, [eax]     ; mask the consecutive bits
>        pandn	xmm1, [eax+16]
>        pandn	xmm2, [eax+32]
>        pandn	xmm3, [eax+48]
>        mov	eax,  [weights]
>        pcmpeqb	xmm0, xmm7      ; build the -1|0 byte masks
>        pcmpeqb	xmm1, xmm7
>        pcmpeqb	xmm2, xmm7
>        pcmpeqb	xmm3, xmm7
>
>        pand	xmm0, [eax]     ; and them with the weights
>        pand	xmm1, [eax+16]
>        pand	xmm2, [eax+32]
>        pand	xmm3, [eax+48]
>        psadbw	xmm0, xmm7      ; horizontal adds
>        psadbw	xmm1, xmm7      ;  bytes to word
>        psadbw	xmm2, xmm7
>        psadbw	xmm3, xmm7
>        paddd	xmm0, xmm1      ; vertical add
>        paddd	xmm2, xmm3
>        paddd	xmm2, xmm0
>        movdqa	xmm0, xmm2      ; finally add high and low quad words
>        punpckhqdq xmm0, xmm2   ; like shift right
>        paddd	xmm0, xmm2
>        movd	eax, xmm0       ; expensive to move in gp-register!
>    }
>}
>
>int main(int argc, char* argv[])
>{
>  WType CACHE_ALIGN weights[64] =
>  {
>    1, 2, 3, 4, 4, 3, 2, 1, // a1..a8
>    2, 3, 4, 5, 5, 4, 3, 2, // b1..b8
>    3, 4, 5, 6, 6, 5, 4, 3, // c1..c8
>    4, 5, 6, 7, 7, 6, 5, 4, // d1..d8
>    4, 5, 6, 7, 7, 6, 5, 4, // e1..e8
>    3, 4, 5, 6, 6, 5, 4, 3, // f1..f8
>    2, 3, 4, 5, 5, 4, 3, 2, // g1..g8
>    1, 2, 3, 4, 4, 3, 2, 1, // h1..h8
>  };
>
>  int bonus = dotProduct(0xfedcba9876543210, weights);
>  return bonus;
>}

I am guessing something like 50 cycles?  Really not that bad . . . probably
close to the speed of a scan over attack tables.

anthony



This page took 0.03 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.