Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Gerd Isenberg

Date: 10:21:23 07/19/04

On July 19, 2004 at 10:59:05, Anthony Cozzie wrote:

>On July 18, 2004 at 15:33:33, Gerd Isenberg wrote:
>
>>
>>>I am guessing something like 50 cycles?  Really not that bad . . . probably
>>>close to the speed of a scan over attack tables.
>>>
>>>anthony
>>
>>14.45ns on a 2.2GHz Athlon64, ~32 cycles now.
>>
>>Some minor changes, byte vector values (weights) 0..63, therefore only one
>>psadbw, no movd but two pextrw, final add with gp. Computed bit masks in two
>>xmm-registers (0x02:0x01). Some better instruction scheduling.
>>
>>Gerd
>
>If you would ship me the new code I would be much obliged (acozzie@verizon.net).
> I am concentrating on parallel code right now, but once that is done I am going
>to do some serious work on my eval.  I want to prove Vincent wrong that a good
>eval cannot be done with bitboards :)
>
>32 cycles is _really_ good.  I think that on average rotated bitboard attack
>generation is 20 cycles, so that is 50 cycles / piece / mobility = 500 cycles
>(~250 ns on my computer) for all pieces, which is really not bad.  In fact, 32
>cycles is not that much slower than popcount!
>
>anthony

In contradiction to what a said before, the pandn with constant data seems
faster. To precompute the constants one may try

    pxor    xmm4, xmm4  ; 0
    pcmpeqb xmm5, xmm5  ; -1
    psubb   xmm4, xmm5  ; 1
    pslldq  xmm5, 8     ; -1:0
    psubb   xmm4, xmm5  ; 0x02:0x01
    pslld   xmm4, 2     ; 0x08:0x04
    pslld   xmm4, 2     ; 0x20:0x10
    pslld   xmm4, 2     ; 0x80:0x40

Here the routine, 24 SSE2 instructions, one direct path, 23 double:
12 * 4 cycles  = 48
12 * 2 cycles  = 24

Cheers,
Gerd


typedef unsigned __int64 BitBoard;
typedef unsigned char WType;

#define CACHE_LINE  64
#define CACHE_ALIGN __declspec(align(CACHE_LINE))


/*
  bit[64]*byte[64] dotProduct for AMD64 in 32-bit mode
  (C) CCC 2004

  implemented by Gerd Isenberg
  initiated by a post from Tony Werten -
  and a reply from Bob Hyatt about Cray's vector instructions

  parameters:
   bb, a 64-bit word as a boolean vector of bits
   weights63, a vector of 64 byte values (weights) in the range of 0..63

  Note! the weight array should be rotated by 90 degree
   if both vectors are consideres as two dimensional 8*8 arrays,
   the rotation takes place by exchanging row and column.
  For all eight rows,columns:
                   _
      dotProduct = >  (bb[row][col] ? weights63[col][row] : 0)
                   -
  return: dot product

*/
int dotProduct(BitBoard bb, WType* weights63)
{
    static const BitBoard CACHE_ALIGN MaskConsts[8] =
    {
        0x0101010101010101, 0x0202020202020202,
        0x0404040404040404, 0x0808080808080808,
        0x1010101010101010, 0x2020202020202020,
        0x4040404040404040, 0x8080808080808080,
    };
    __asm
    {
        movq	xmm0, [bb]     ; get the bb in lower
        lea	eax, [MaskConsts]
        punpcklqdq xmm0, xmm0  ; .. and upper half
        pxor	xmm7, xmm7     ; zero
        movdqa	xmm1, xmm0
        movdqa	xmm2, xmm0
        movdqa	xmm3, xmm0
        pandn	xmm0, [eax]    ; mask the consecutive bits
        pandn	xmm1, [eax+16]
        pcmpeqb	xmm0, xmm7     ; build the -1|0 byte masks
        pcmpeqb	xmm1, xmm7
        pandn	xmm2, [eax+32]
        pandn	xmm3, [eax+48]
        mov	eax, [weights63]
        pcmpeqb	xmm2, xmm7
        pcmpeqb	xmm3, xmm7
        pand	xmm0, [eax]    ; and them with the weights
        pand	xmm1, [eax+16]
        pand	xmm2, [eax+32]
        pand	xmm3, [eax+48]
        paddb	xmm0, xmm1     ; add all 64 bytes, take care of overflow
        paddb	xmm0, xmm2     ;  therefore max weight of 63 is safe
        paddb	xmm0, xmm3
        psadbw	xmm0, xmm7     ; horizontal add 2 * 8 byte
        pextrw	edx, xmm0, 4   ; extract both intermediate sums to gp
        pextrw	eax, xmm0, 0
        add	eax, edx       ; final add in gp
    }
}

Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 08:54:44 07/20/04
- Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 11:23:30 07/20/04
  - Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 12:57:36 07/20/04
    - Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 23:03:46 07/20/04
      - Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 02:37:22 07/21/04
        
        Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 06:32:22 07/21/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 05:38:48 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Daniel Clausen 05:49:32 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 08:07:04 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Fabien Letouzey 06:44:15 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 06:56:50 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Fabien Letouzey 07:15:33 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 08:36:33 07/22/04
- Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 09:31:59 07/20/04
  - Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 10:54:12 07/21/04
    - ignore previous post Gerd Isenberg 11:05:33 07/21/04
  - Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 09:39:17 07/20/04
    - Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 02:16:00 07/21/04
      - Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 06:41:48 07/21/04

This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.