Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Anthony Cozzie

Date: 08:54:44 07/20/04

Go up one level in this thread


On July 19, 2004 at 13:21:23, Gerd Isenberg wrote:

>On July 19, 2004 at 10:59:05, Anthony Cozzie wrote:
>
>>On July 18, 2004 at 15:33:33, Gerd Isenberg wrote:
>>
>>>
>>>>I am guessing something like 50 cycles?  Really not that bad . . . probably
>>>>close to the speed of a scan over attack tables.
>>>>
>>>>anthony
>>>
>>>14.45ns on a 2.2GHz Athlon64, ~32 cycles now.
>>>
>>>Some minor changes, byte vector values (weights) 0..63, therefore only one
>>>psadbw, no movd but two pextrw, final add with gp. Computed bit masks in two
>>>xmm-registers (0x02:0x01). Some better instruction scheduling.
>>>
>>>Gerd
>>
>>If you would ship me the new code I would be much obliged (acozzie@verizon.net).
>> I am concentrating on parallel code right now, but once that is done I am going
>>to do some serious work on my eval.  I want to prove Vincent wrong that a good
>>eval cannot be done with bitboards :)
>>
>>32 cycles is _really_ good.  I think that on average rotated bitboard attack
>>generation is 20 cycles, so that is 50 cycles / piece / mobility = 500 cycles
>>(~250 ns on my computer) for all pieces, which is really not bad.  In fact, 32
>>cycles is not that much slower than popcount!
>>
>>anthony
>
>In contradiction to what a said before, the pandn with constant data seems
>faster. To precompute the constants one may try
>
>    pxor    xmm4, xmm4  ; 0
>    pcmpeqb xmm5, xmm5  ; -1
>    psubb   xmm4, xmm5  ; 1
>    pslldq  xmm5, 8     ; -1:0
>    psubb   xmm4, xmm5  ; 0x02:0x01
>    pslld   xmm4, 2     ; 0x08:0x04
>    pslld   xmm4, 2     ; 0x20:0x10
>    pslld   xmm4, 2     ; 0x80:0x40
>
>Here the routine, 24 SSE2 instructions, one direct path, 23 double:
>12 * 4 cycles  = 48
>12 * 2 cycles  = 24
>
>Cheers,
>Gerd
>
>
>typedef unsigned __int64 BitBoard;
>typedef unsigned char WType;
>
>#define CACHE_LINE  64
>#define CACHE_ALIGN __declspec(align(CACHE_LINE))
>
>
>/*
>  bit[64]*byte[64] dotProduct for AMD64 in 32-bit mode
>  (C) CCC 2004
>
>  implemented by Gerd Isenberg
>  initiated by a post from Tony Werten -
>  and a reply from Bob Hyatt about Cray's vector instructions
>
>  parameters:
>   bb, a 64-bit word as a boolean vector of bits
>   weights63, a vector of 64 byte values (weights) in the range of 0..63
>
>  Note! the weight array should be rotated by 90 degree
>   if both vectors are consideres as two dimensional 8*8 arrays,
>   the rotation takes place by exchanging row and column.
>  For all eight rows,columns:
>                   _
>      dotProduct = >  (bb[row][col] ? weights63[col][row] : 0)
>                   -
>  return: dot product
>
>*/

OK, I have gone through the opteron optimization manual.  From what I can tell:

The opteron can do 2 64-bit memory accesses each cycle (or 1 SSE2 memory
access).  The cache has a latency of 3 cycles.

Most SSE2 instructions are split into 2 64-bit operations and thus occupy the
full pipeline (throughput is 1, not 2).  They have a latency of 2.

cycle estimate
  1      movq	xmm0, [bb]     ; get the bb in lower
  2      lea	eax, [MaskConsts]
  4      punpcklqdq xmm0, xmm0  ; .. and upper half
  3      pxor	xmm7, xmm7     ; zero
  6      movdqa	xmm1, xmm0
  7      movdqa	xmm2, xmm0
  8      movdqa	xmm3, xmm0
  9      pandn	xmm0, [eax]    ; mask the consecutive bits
 10      pandn	xmm1, [eax+16]
 11      pandn	xmm2, [eax+32]
 12      pandn	xmm3, [eax+48]
 13      mov	eax, [weights63]
 13      pcmpeqb	xmm0, xmm7     ; build the -1|0 byte masks
 14      pcmpeqb	xmm1, xmm7
 15      pcmpeqb	xmm2, xmm7
 16      pcmpeqb	xmm3, xmm7
 17      pand	xmm0, [eax]    ; and them with the weights
 18      pand	xmm1, [eax+16]
 19      pand	xmm2, [eax+32]
 20      pand	xmm3, [eax+48]
 22      paddb	xmm0, xmm1     ; add all 64 bytes, take care of overflow
 24      paddb	xmm2, xmm3     ;  therefore max weight of 63 is safe
 26      paddb	xmm0, xmm2
 28      psadbw	xmm0, xmm7     ; horizontal add 2 * 8 byte
 32      pextrw	edx, xmm0, 4   ; extract both intermediate sums to gp
 33      pextrw	eax, xmm0, 0
 37      add	eax, edx       ; final add in gp

So my estimate is 37 cycles.  Of course, it might be possible to get a little
bit of overlap near the end if you had other work to do.

What do you think of the following?

cycle estimate
  1  movl       xmm0, [bb]
  1  movl       xmm2, [bb+1]
  4  punpcklbw  xmm0, xmm0
  5  punpcklbw  xmm2, xmm2
  6  punpcklbw  xmm0, xmm0
  7  punpcklbw  xmm2, xmm2
  8  movdqa     xmm1, xmm0
  9  movdqa     xmm3, xmm2
 10  punpcklbw  xmm0, xmm0
 11  punpcklbw  xmm2, xmm2
 12  punpckhbw  xmm1, xmm1
 13  punpckhbw  xmm3, xmm3

This costs 5 more cycles, but the array is no longer rotated (I think).  This
means that it is much easier to create the weight array dynamically.  I don't
know whether it is a win or not.

Lastly, are you sure this wouldn't all be faster in the integer pipes?  From
what I can see, the integer pipes have as much or more computational power as
the "vector" units.  If the Opteron could execute 4 SSE2 instructions in
parallel it would be different, but . . .

anthony



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.