Computer Chess Club Archives




Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Gerd Isenberg

Date: 02:37:22 07/21/04

Go up one level in this thread

On July 21, 2004 at 02:03:46, Gerd Isenberg wrote:

>On July 20, 2004 at 15:57:36, Anthony Cozzie wrote:
>>Two more tricks:
>>First, if I understand the opteron correctly it can execute 3 SSE2 instructions
>>every 2 cycles _if_ there are 2 arithmetic and 1 cache instruction in this
>>bundle.  Therefore, I pushed all the loads farther up (this also has the
>>advantage of mitigating cache misses), and reduced the mask to a single 16 byte
>>constant.  This routine will only execute in 64-bit mode because it requires 10
>>XMM registers.  Also, I used unsigned saturated addition, which means that in
>>practice the values can exceed 63 (although you take some risks).  There is a
>>big hole at the end of the routine: it takes 10 cycles to get the results out of
>>the XMM register back into the integer pipe.  Since all the cache accesses are
>>near the front of the routine, it should be possible for the processor to
>>interleave some integer code at the end.
>Ahh, 64 bit mode already:
  1  movd       xmm0, [bb]
  1  movd       xmm2, [bb+4]
  2  mov        rax, weights63
  3  pxor       xmm5, xmm5
  3  movdqa     xmm4, [and_constant]
  4  punpcklbw  xmm0, xmm0
  4  movdqa     xmm6, [rax] ; prefetch the line
  5  punpcklbw  xmm2, xmm2
  6  punpcklbw  xmm0, xmm0
  7  punpcklbw  xmm2, xmm2
  8  movdqa     xmm1, xmm0
  9  movdqa     xmm3, xmm2
 10  punpcklbw  xmm0, xmm0
 11  punpcklbw  xmm2, xmm2
 12  punpckhbw  xmm1, xmm1
 13  punpckhbw  xmm3, xmm3
 14  pandn      xmm0, xmm4           ; select bit out of each byte
 15  pandn      xmm1, xmm4
 16  pandn      xmm2, xmm4
 17  pandn      xmm3, xmm4
 18  pcmpeqb    xmm0, xmm5           ; convert to 0 | 0xFF
 19  pcmpeqb    xmm1, xmm5
 20  pcmpeqb    xmm2, xmm5
 21  pcmpeqb    xmm3, xmm5
 22  pand       xmm0, xmm6           ; and with weights
 23  pand       xmm1, [eax+16]
 24  pand       xmm2, [eax+32]
 25  pand       xmm3, [eax+48]
 26  paddusb    xmm0, xmm1           ; using paddusb allows us to take risks
 27  paddusb    xmm0, xmm2
 28  paddusb    xmm0, xmm3
 30  psadbw	xmm0, xmm7           ; horizontal add 2 * 8 byte
 34  pextrw	edx, xmm0, 4         ; extract both intermediate sums to gp
 35  pextrw	eax, xmm0, 0
 40  add	eax, edx             ; final add in gp
>>My guess is that in a tight loop this would execute in 30 cycles/iteration.  Of
>>course if you have any improvements on my improvements I would love to hear them
>>:)  I am almost certainly going to use this code once I finish going parallel.
>My guess is that the pre moves don't pay off using four additional registers
>xmm6..9 only one time. Ok, if you pass a second weight pointer one may paddb
>them already here for some dynamic weights with saturation too. So

If prefetching at all, i suggest only one early

    movdqa    xmm6, [rax+48]; // or [rax] see above

to prefetch the weight cacheline, and later to use

    pand       xmm0, [rax]           ; and with weights
    pand       xmm1, [rax+16]
    pand       xmm2, [rax+32]
    pand       xmm3, xmm6

Three instructions and registers less, about 12+4 bytes due to xmm8, xmm9
require additional prefix bytes. And it is still 32-bit compatible.
Btw. using finally 32-bit register (still the default for int) may safe another
three bytes opcode. 32-bit ops implicitely zero extends to 64-bit.

    pextrw     edx, xmm0, 4         ; extract both intermediate sums to gp
    pextrw     eax, xmm0, 0
    add	       eax, edx             ; final add in gp



This page took 0.03 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.