Computer Chess Club Archives




Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Gerd Isenberg

Date: 10:21:23 07/19/04

Go up one level in this thread

On July 19, 2004 at 10:59:05, Anthony Cozzie wrote:

>On July 18, 2004 at 15:33:33, Gerd Isenberg wrote:
>>>I am guessing something like 50 cycles?  Really not that bad . . . probably
>>>close to the speed of a scan over attack tables.
>>14.45ns on a 2.2GHz Athlon64, ~32 cycles now.
>>Some minor changes, byte vector values (weights) 0..63, therefore only one
>>psadbw, no movd but two pextrw, final add with gp. Computed bit masks in two
>>xmm-registers (0x02:0x01). Some better instruction scheduling.
>If you would ship me the new code I would be much obliged (
> I am concentrating on parallel code right now, but once that is done I am going
>to do some serious work on my eval.  I want to prove Vincent wrong that a good
>eval cannot be done with bitboards :)
>32 cycles is _really_ good.  I think that on average rotated bitboard attack
>generation is 20 cycles, so that is 50 cycles / piece / mobility = 500 cycles
>(~250 ns on my computer) for all pieces, which is really not bad.  In fact, 32
>cycles is not that much slower than popcount!

In contradiction to what a said before, the pandn with constant data seems
faster. To precompute the constants one may try

    pxor    xmm4, xmm4  ; 0
    pcmpeqb xmm5, xmm5  ; -1
    psubb   xmm4, xmm5  ; 1
    pslldq  xmm5, 8     ; -1:0
    psubb   xmm4, xmm5  ; 0x02:0x01
    pslld   xmm4, 2     ; 0x08:0x04
    pslld   xmm4, 2     ; 0x20:0x10
    pslld   xmm4, 2     ; 0x80:0x40

Here the routine, 24 SSE2 instructions, one direct path, 23 double:
12 * 4 cycles  = 48
12 * 2 cycles  = 24


typedef unsigned __int64 BitBoard;
typedef unsigned char WType;

#define CACHE_LINE  64
#define CACHE_ALIGN __declspec(align(CACHE_LINE))

  bit[64]*byte[64] dotProduct for AMD64 in 32-bit mode
  (C) CCC 2004

  implemented by Gerd Isenberg
  initiated by a post from Tony Werten -
  and a reply from Bob Hyatt about Cray's vector instructions

   bb, a 64-bit word as a boolean vector of bits
   weights63, a vector of 64 byte values (weights) in the range of 0..63

  Note! the weight array should be rotated by 90 degree
   if both vectors are consideres as two dimensional 8*8 arrays,
   the rotation takes place by exchanging row and column.
  For all eight rows,columns:
      dotProduct = >  (bb[row][col] ? weights63[col][row] : 0)
  return: dot product

int dotProduct(BitBoard bb, WType* weights63)
    static const BitBoard CACHE_ALIGN MaskConsts[8] =
        0x0101010101010101, 0x0202020202020202,
        0x0404040404040404, 0x0808080808080808,
        0x1010101010101010, 0x2020202020202020,
        0x4040404040404040, 0x8080808080808080,
        movq	xmm0, [bb]     ; get the bb in lower
        lea	eax, [MaskConsts]
        punpcklqdq xmm0, xmm0  ; .. and upper half
        pxor	xmm7, xmm7     ; zero
        movdqa	xmm1, xmm0
        movdqa	xmm2, xmm0
        movdqa	xmm3, xmm0
        pandn	xmm0, [eax]    ; mask the consecutive bits
        pandn	xmm1, [eax+16]
        pcmpeqb	xmm0, xmm7     ; build the -1|0 byte masks
        pcmpeqb	xmm1, xmm7
        pandn	xmm2, [eax+32]
        pandn	xmm3, [eax+48]
        mov	eax, [weights63]
        pcmpeqb	xmm2, xmm7
        pcmpeqb	xmm3, xmm7
        pand	xmm0, [eax]    ; and them with the weights
        pand	xmm1, [eax+16]
        pand	xmm2, [eax+32]
        pand	xmm3, [eax+48]
        paddb	xmm0, xmm1     ; add all 64 bytes, take care of overflow
        paddb	xmm0, xmm2     ;  therefore max weight of 63 is safe
        paddb	xmm0, xmm3
        psadbw	xmm0, xmm7     ; horizontal add 2 * 8 byte
        pextrw	edx, xmm0, 4   ; extract both intermediate sums to gp
        pextrw	eax, xmm0, 0
        add	eax, edx       ; final add in gp

This page took 0.03 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.