Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Gerd Isenberg

Date: 05:38:48 07/22/04

Go up one level in this thread


<snip>
>>  1  movd       xmm0, [bb]
>>  1  movd       xmm2, [bb+4]
>>  2  mov        rax, weights63
>>  3  pxor       xmm5, xmm5
>>  3  movdqa     xmm4, [and_constant]
>>  4  punpcklbw  xmm0, xmm0
>>  4  movdqa     xmm6, [rax] ; prefetch the line
>>  5  punpcklbw  xmm2, xmm2
>>  6  punpcklbw  xmm0, xmm0
>>  7  punpcklbw  xmm2, xmm2
>>  8  movdqa     xmm1, xmm0
>>  9  movdqa     xmm3, xmm2
>> 10  punpcklbw  xmm0, xmm0
>> 11  punpcklbw  xmm2, xmm2
>> 12  punpckhbw  xmm1, xmm1
>> 13  punpckhbw  xmm3, xmm3
>> 14  pandn      xmm0, xmm4           ; select bit out of each byte
>> 15  pandn      xmm1, xmm4
>> 16  pandn      xmm2, xmm4
>> 17  pandn      xmm3, xmm4
>> 18  pcmpeqb    xmm0, xmm5           ; convert to 0 | 0xFF
>> 19  pcmpeqb    xmm1, xmm5
>> 20  pcmpeqb    xmm2, xmm5
>> 21  pcmpeqb    xmm3, xmm5
>> 22  pand       xmm0, xmm6           ; and with weights
>> 23  pand       xmm1, [eax+16]
>> 24  pand       xmm2, [eax+32]
>> 25  pand       xmm3, [eax+48]
>> 26  paddusb    xmm0, xmm1           ; using paddusb allows us to take risks
>> 27  paddusb    xmm0, xmm2
>> 28  paddusb    xmm0, xmm3
>> 30  psadbw	xmm0, xmm7           ; horizontal add 2 * 8 byte
>> 34  pextrw	edx, xmm0, 4         ; extract both intermediate sums to gp
>> 35  pextrw	eax, xmm0, 0
>> 40  add	eax, edx             ; final add in gp
>>>>
>>>>My guess is that in a tight loop this would execute in 30 cycles/iteration.  Of
>>>>course if you have any improvements on my improvements I would love to hear them
>>>>:)  I am almost certainly going to use this code once I finish going parallel.
>>>
>>>Yes.
>>>
>>>My guess is that the pre moves don't pay off using four additional registers
>>>xmm6..9 only one time. Ok, if you pass a second weight pointer one may paddb
>>>them already here for some dynamic weights with saturation too. So
>>>
>>
>>If prefetching at all, i suggest only one early
>>
>>    movdqa    xmm6, [rax+48]; // or [rax] see above
>>
>>to prefetch the weight cacheline, and later to use
>>
>>    pand       xmm0, [rax]           ; and with weights
>>    pand       xmm1, [rax+16]
>>    pand       xmm2, [rax+32]
>>    pand       xmm3, xmm6
>>
>>Three instructions and registers less, about 12+4 bytes due to xmm8, xmm9
>>require additional prefix bytes. And it is still 32-bit compatible.
>>Btw. using finally 32-bit register (still the default for int) may safe another
>>three bytes opcode. 32-bit ops implicitely zero extends to 64-bit.
>>
>>    pextrw     edx, xmm0, 4         ; extract both intermediate sums to gp
>>    pextrw     eax, xmm0, 0
>>    add	       eax, edx             ; final add in gp
>>
>>
>
>Makes good sense.  We don't save anything by prefetching the first 3 anyway.
>When I get home I'll code this up and see how long it takes in a loop.
>
>anthony


Your original shuffling sequence makes sense,
if bb on the stack is not 8- but only 4-byte aligned.

  movd       xmm0, [bb]
  movd       xmm2, [bb+4]
  punpcklbw  xmm0, xmm0
  punpcklbw  xmm2, xmm2
  punpcklbw  xmm0, xmm0
  punpcklbw  xmm2, xmm2
  movdqa     xmm1, xmm0
  movdqa     xmm3, xmm2
  punpcklbw  xmm0, xmm0
  punpcklbw  xmm2, xmm2
  punpckhbw  xmm1, xmm1
  punpckhbw  xmm3, xmm3

Another similar shuffling sequence, one instruction less and therefore slightly
faster with appropriate scheduling of independent initialization instructions.
It may be used with passed bitboard already in xmm0:

  movq       xmm0, [bb]  ; 0x0000000000000000:0xf0e1d2c3b4a59687
  punpcklbw  xmm0, xmm0  ; 0xf0f0e1e1d2d2c3c3:0xb4b4a5a596968787
  movdqa     xmm2, xmm0
  punpcklwd  xmm0, xmm0  ; 0xb4b4b4b4a5a5a5a5:0x9696969687878787
  punpckhwd  xmm2, xmm2  ; 0xf0f0f0f0e1e1e1e1:0xd2d2d2d2c3c3c3c3
  movdqa     xmm1, xmm0
  movdqa     xmm3, xmm2
  punpckldq  xmm0, xmm0  ; 0x9696969696969696:0x8787878787878787
  punpckhdq  xmm1, xmm1  ; 0xb4b4b4b4b4b4b4b4:0xa5a5a5a5a5a5a5a5
  punpckldq  xmm2, xmm2  ; 0xd2d2d2d2d2d2d2d2:0xc3c3c3c3c3c3c3c3
  punpckhdq  xmm3, xmm3  ; 0xf0f0f0f0f0f0f0f0:0xe1e1e1e1e1e1e1e1

The Unpack and Interleave instructions became familar now ;-)
Anyway i'm not quite sure whether to take this additional overhead or to stay
with the rotated weights.

  movq       xmm0, [bb]  ; 0x0000000000000000:0xf0e1d2c3b4a59687
  punpcklqdq xmm0, xmm0  ; 0xf0e1d2c3b4a59687:0xf0e1d2c3b4a59687
  movdqa     xmm1, xmm0  ; 0xf0e1d2c3b4a59687:0xf0e1d2c3b4a59687
  movdqa     xmm2, xmm0  ; 0xf0e1d2c3b4a59687:0xf0e1d2c3b4a59687
  movdqa     xmm3, xmm0  ; 0xf0e1d2c3b4a59687:0xf0e1d2c3b4a59687

Rotating a square index here and there is not that expensive...
I plan most weights already precomputed and probably indexed by some
(king)squares and side.

Gerd



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.