Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Anthony Cozzie

Date: 17:59:41 07/17/04

On July 17, 2004 at 16:14:10, Gerd Isenberg wrote:

>Got it - but count the cycles by yourself.
>
>With some 90-degree rotation trick of the 64-byte weight array, the task of a
>dot product is rather easy with AMD64 SSE2 instructions. 31 instructions, only
>134 bytes for that routine in 32-bit mode. Most of the time four independent
>instruction chains.
>
>Considering the bitboard inside the lower half of a 128-byte xmm-register, a
>punpcklqdq xmm,xmm instruction (Unpack and Interleave Low Quadwords) copies it
>into the upper half.
>
>With three further 128-bit movdqa we have now eight copies of the bitboard in
>four xmm-registers. Now the eight bitboards are anded with:
>
>	0x0202020202020202:0x0101010101010101
>	0x0808080808080808:0x0404040404040404
>	0x2020202020202020:0x1010101010101010
>	0x8080808080808080:0x4040404040404040
>
>to mask each bit bytewise, that is only one bit per byte.
>
>The 90-degree rotation of the resulting byte vector and therefore the weight
>array too, is because 0x0101010101010101 masks a1,a2,a3...a8, while
>0x8080808080808080 masks h1..h8.
>
>The next step is to use the "Packed Compare Equal Bytes" instruction with zero
>to get 0 and -1 masks. Since equal zero produces 0xff and unequal zero produces
>0x00, the result is complemented for our purpose. Fortunately there is a
>pandn-instruction. The result is an 64-byte {0|-1}-vector with
>a1,a2,a3...a8,b1...h8.
>
>The rest was already mentioned, four further ands with the appropriate mapped
>weight vector, then adding all bytes together to one 16 bit word.
>
>Cheers,
>Gerd
>
>
>//=========================================================
>// some sample source, i tried with MSC6 and inline assemby
>//=========================================================
>
>#include <stdio.h>
>
>typedef unsigned __int64 BitBoard;
>typedef unsigned char WType;
>
>#define CACHE_LINE  64
>#define CACHE_ALIGN __declspec(align(CACHE_LINE))
>
>
>struct SMaskConsts
>{
>    BitBoard C01;    BitBoard C02;
>    BitBoard C04;    BitBoard C08;
>    BitBoard C10;    BitBoard C20;
>    BitBoard C40;    BitBoard C80;
>};
>
>const SMaskConsts CACHE_ALIGN MaskConsts =
>{
>    0x0101010101010101,	0x0202020202020202,
>    0x0404040404040404,	0x0808080808080808,
>    0x1010101010101010,	0x2020202020202020,
>    0x4040404040404040,	0x8080808080808080,
>};
>
>int dotProduct(BitBoard bb, WType* weights)
>{
>    __asm
>    {
>        movq	xmm0, [bb]      ; get the bb in lower
>        lea	eax,  [MaskConsts]
>        punpcklqdq xmm0, xmm0   ; .. and upper half
>        pxor	xmm7, xmm7	; zero for compare and byte diff
>        movdqa	xmm1, xmm0
>        movdqa	xmm2, xmm0
>        movdqa	xmm3, xmm0
>        pandn	xmm0, [eax]     ; mask the consecutive bits
>        pandn	xmm1, [eax+16]
>        pandn	xmm2, [eax+32]
>        pandn	xmm3, [eax+48]
>        mov	eax,  [weights]
>        pcmpeqb	xmm0, xmm7      ; build the -1|0 byte masks
>        pcmpeqb	xmm1, xmm7
>        pcmpeqb	xmm2, xmm7
>        pcmpeqb	xmm3, xmm7
>
>        pand	xmm0, [eax]     ; and them with the weights
>        pand	xmm1, [eax+16]
>        pand	xmm2, [eax+32]
>        pand	xmm3, [eax+48]
>        psadbw	xmm0, xmm7      ; horizontal adds
>        psadbw	xmm1, xmm7      ;  bytes to word
>        psadbw	xmm2, xmm7
>        psadbw	xmm3, xmm7
>        paddd	xmm0, xmm1      ; vertical add
>        paddd	xmm2, xmm3
>        paddd	xmm2, xmm0
>        movdqa	xmm0, xmm2      ; finally add high and low quad words
>        punpckhqdq xmm0, xmm2   ; like shift right
>        paddd	xmm0, xmm2
>        movd	eax, xmm0       ; expensive to move in gp-register!
>    }
>}
>
>int main(int argc, char* argv[])
>{
>  WType CACHE_ALIGN weights[64] =
>  {
>    1, 2, 3, 4, 4, 3, 2, 1, // a1..a8
>    2, 3, 4, 5, 5, 4, 3, 2, // b1..b8
>    3, 4, 5, 6, 6, 5, 4, 3, // c1..c8
>    4, 5, 6, 7, 7, 6, 5, 4, // d1..d8
>    4, 5, 6, 7, 7, 6, 5, 4, // e1..e8
>    3, 4, 5, 6, 6, 5, 4, 3, // f1..f8
>    2, 3, 4, 5, 5, 4, 3, 2, // g1..g8
>    1, 2, 3, 4, 4, 3, 2, 1, // h1..h8
>  };
>
>  int bonus = dotProduct(0xfedcba9876543210, weights);
>  return bonus;
>}

I am guessing something like 50 cycles?  Really not that bad . . . probably
close to the speed of a scan over attack tables.

anthony

Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 12:33:33 07/18/04
- Re: SSE2 bit[64] * byte[64] dot product Tony Werten 23:36:05 07/19/04
- Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 07:59:05 07/19/04
  - Re: SSE2 bit[64] * byte[64] dot product Russell Reagan 21:51:40 07/19/04
    - Re: SSE2 bit[64] * byte[64] dot product Tony Werten 23:25:03 07/19/04
      - Re: SSE2 bit[64] * byte[64] dot product Russell Reagan 00:05:47 07/20/04
        
        Re: SSE2 bit[64] * byte[64] dot product Tony Werten 01:03:26 07/20/04
        
        Re: SSE2 bit[64] * byte[64] dot product Russell Reagan 02:05:21 07/20/04
  - Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 10:21:23 07/19/04
    - Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 08:54:44 07/20/04
      - Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 11:23:30 07/20/04
        
        Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 12:57:36 07/20/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 23:03:46 07/20/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 02:37:22 07/21/04
        
        Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 06:32:22 07/21/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 05:38:48 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Daniel Clausen 05:49:32 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 08:07:04 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Fabien Letouzey 06:44:15 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 06:56:50 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Fabien Letouzey 07:15:33 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 08:36:33 07/22/04
      - Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 09:31:59 07/20/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 10:54:12 07/21/04
        
        ignore previous post Gerd Isenberg 11:05:33 07/21/04
        
        Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 09:39:17 07/20/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 02:16:00 07/21/04
        
        Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 06:41:48 07/21/04
  - Re: SSE2 bit[64] * byte[64] dot product Fabien Letouzey 08:35:10 07/19/04
    - Re: SSE2 bit[64] * byte[64] dot product Fabien Letouzey 08:52:29 07/19/04
      - Re: SSE2 bit[64] * byte[64] dot product Tony Werten 23:33:20 07/19/04
Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 01:52:26 07/18/04

This page took 0.02 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.