Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Gerd Isenberg

Date: 11:23:30 07/20/04

Go up one level in this thread


On July 20, 2004 at 11:54:44, Anthony Cozzie wrote:

>On July 19, 2004 at 13:21:23, Gerd Isenberg wrote:
>
>>On July 19, 2004 at 10:59:05, Anthony Cozzie wrote:
>>
>>>On July 18, 2004 at 15:33:33, Gerd Isenberg wrote:
>>>
>>>>
>>>>>I am guessing something like 50 cycles?  Really not that bad . . . probably
>>>>>close to the speed of a scan over attack tables.
>>>>>
>>>>>anthony
>>>>
>>>>14.45ns on a 2.2GHz Athlon64, ~32 cycles now.
>>>>
>>>>Some minor changes, byte vector values (weights) 0..63, therefore only one
>>>>psadbw, no movd but two pextrw, final add with gp. Computed bit masks in two
>>>>xmm-registers (0x02:0x01). Some better instruction scheduling.
>>>>
>>>>Gerd
>>>
>>>If you would ship me the new code I would be much obliged (acozzie@verizon.net).
>>> I am concentrating on parallel code right now, but once that is done I am going
>>>to do some serious work on my eval.  I want to prove Vincent wrong that a good
>>>eval cannot be done with bitboards :)
>>>
>>>32 cycles is _really_ good.  I think that on average rotated bitboard attack
>>>generation is 20 cycles, so that is 50 cycles / piece / mobility = 500 cycles
>>>(~250 ns on my computer) for all pieces, which is really not bad.  In fact, 32
>>>cycles is not that much slower than popcount!
>>>
>>>anthony
>>
>>In contradiction to what a said before, the pandn with constant data seems
>>faster. To precompute the constants one may try
>>
>>    pxor    xmm4, xmm4  ; 0
>>    pcmpeqb xmm5, xmm5  ; -1
>>    psubb   xmm4, xmm5  ; 1
>>    pslldq  xmm5, 8     ; -1:0
>>    psubb   xmm4, xmm5  ; 0x02:0x01
>>    pslld   xmm4, 2     ; 0x08:0x04
>>    pslld   xmm4, 2     ; 0x20:0x10
>>    pslld   xmm4, 2     ; 0x80:0x40
>>
>>Here the routine, 24 SSE2 instructions, one direct path, 23 double:
>>12 * 4 cycles  = 48
>>12 * 2 cycles  = 24
>>
>>Cheers,
>>Gerd
>>
>>
>>typedef unsigned __int64 BitBoard;
>>typedef unsigned char WType;
>>
>>#define CACHE_LINE  64
>>#define CACHE_ALIGN __declspec(align(CACHE_LINE))
>>
>>
>>/*
>>  bit[64]*byte[64] dotProduct for AMD64 in 32-bit mode
>>  (C) CCC 2004
>>
>>  implemented by Gerd Isenberg
>>  initiated by a post from Tony Werten -
>>  and a reply from Bob Hyatt about Cray's vector instructions
>>
>>  parameters:
>>   bb, a 64-bit word as a boolean vector of bits
>>   weights63, a vector of 64 byte values (weights) in the range of 0..63
>>
>>  Note! the weight array should be rotated by 90 degree
>>   if both vectors are consideres as two dimensional 8*8 arrays,
>>   the rotation takes place by exchanging row and column.
>>  For all eight rows,columns:
>>                   _
>>      dotProduct = >  (bb[row][col] ? weights63[col][row] : 0)
>>                   -
>>  return: dot product
>>
>>*/
>
>OK, I have gone through the opteron optimization manual.  From what I can tell:
>
>The opteron can do 2 64-bit memory accesses each cycle (or 1 SSE2 memory
>access).  The cache has a latency of 3 cycles.
>
>Most SSE2 instructions are split into 2 64-bit operations and thus occupy the
>full pipeline (throughput is 1, not 2).  They have a latency of 2.
>
>cycle estimate
>  1      movq	xmm0, [bb]     ; get the bb in lower
>  2      lea	eax, [MaskConsts]
>  4      punpcklqdq xmm0, xmm0  ; .. and upper half
>  3      pxor	xmm7, xmm7     ; zero
>  6      movdqa	xmm1, xmm0
>  7      movdqa	xmm2, xmm0
>  8      movdqa	xmm3, xmm0
>  9      pandn	xmm0, [eax]    ; mask the consecutive bits
> 10      pandn	xmm1, [eax+16]
> 11      pandn	xmm2, [eax+32]
> 12      pandn	xmm3, [eax+48]
> 13      mov	eax, [weights63]
> 13      pcmpeqb	xmm0, xmm7     ; build the -1|0 byte masks
> 14      pcmpeqb	xmm1, xmm7
> 15      pcmpeqb	xmm2, xmm7
> 16      pcmpeqb	xmm3, xmm7
> 17      pand	xmm0, [eax]    ; and them with the weights
> 18      pand	xmm1, [eax+16]
> 19      pand	xmm2, [eax+32]
> 20      pand	xmm3, [eax+48]
> 22      paddb	xmm0, xmm1     ; add all 64 bytes, take care of overflow
> 24      paddb	xmm0, xmm2     ;  therefore max weight of 63 is safe
> 26      paddb	xmm0, xmm3
> 28      psadbw	xmm0, xmm7     ; horizontal add 2 * 8 byte
> 32      pextrw	edx, xmm0, 4   ; extract both intermediate sums to gp
> 33      pextrw	eax, xmm0, 0
> 37      add	eax, edx       ; final add in gp
>
>So my estimate is 37 cycles.  Of course, it might be possible to get a little
>bit of overlap near the end if you had other work to do.

Thanks for pointing out the cycle arithmetic.

For instance an independent trailing instruction, starts at cycle 32 too, or
even earlier? What about the parallel use of the "three" 64-bit alus, fadd, fmul
and load/store?

A dump loop test of 10**9 runs finishes in less than 11 seconds (~24cycles)
inlined, constant parameter, the asm output looks ok.
Your version is about ~13.75 seconds (~30cycles).

Btw. i found working three times on one target register slightly faster than two
independent and one dependent paddb. The intermediate result may stay near the
alus...

>
>What do you think of the following?
>
>cycle estimate
  1  movd       xmm0, [bb]
  1  movd       xmm2, [bb+4]
  4  punpcklbw  xmm0, xmm0
  5  punpcklbw  xmm2, xmm2
  6  punpcklbw  xmm0, xmm0
  7  punpcklbw  xmm2, xmm2
  8  movdqa     xmm1, xmm0
  9  movdqa     xmm3, xmm2
 10  punpcklbw  xmm0, xmm0
 11  punpcklbw  xmm2, xmm2
 12  punpckhbw  xmm1, xmm1
 13  punpckhbw  xmm3, xmm3
>
>This costs 5 more cycles, but the array is no longer rotated (I think).  This
>means that it is much easier to create the weight array dynamically.  I don't
>know whether it is a win or not.


What a fine shuffling sequence, looks great!
Another advantage might be the use of only one mask constant:

0x8040201008040201:0x8040201008040201

which may be preloaded into one xmm-register.

For dynamically weight arrays, i would "paddb" some precalulated arrays, eg.
indexed by opposite and/or own (king) square, already inside four xmm-regsiters.
Or incremental updated weight tables for w/b pieces of all kind?

>
>Lastly, are you sure this wouldn't all be faster in the integer pipes?  From
>what I can see, the integer pipes have as much or more computational power as
>the "vector" units.


I guess you have to work hard with 64-bit gp-register to get a better
performance as with the slow SSE2 instructions, but teach me the lesson -
just saw your code - maybe some unrolling?

With newest msc-compiler for AMD64 with SSE2-intrinsics, the compiler may
schedule some other eventually independent gp-instructions in between.
This and the hope for better SSE2-performance (four times?) in the future make
me considering these dotProduct routine in my next approach, i am already
working on.


>If the Opteron could execute 4 SSE2 instructions in
>parallel it would be different, but . . .
>
>anthony


Hope 128-bit xmm-alus come true soon and may be some more...
At least before i'll buy my next box ;-)

Cheers,
Gerd



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.