Computer Chess Club Archives




Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Anthony Cozzie

Date: 12:57:36 07/20/04

Go up one level in this thread

Two more tricks:

First, if I understand the opteron correctly it can execute 3 SSE2 instructions
every 2 cycles _if_ there are 2 arithmetic and 1 cache instruction in this
bundle.  Therefore, I pushed all the loads farther up (this also has the
advantage of mitigating cache misses), and reduced the mask to a single 16 byte
constant.  This routine will only execute in 64-bit mode because it requires 10
XMM registers.  Also, I used unsigned saturated addition, which means that in
practice the values can exceed 63 (although you take some risks).  There is a
big hole at the end of the routine: it takes 10 cycles to get the results out of
the XMM register back into the integer pipe.  Since all the cache accesses are
near the front of the routine, it should be possible for the processor to
interleave some integer code at the end.

  1  movd       xmm0, [bb]
  1  movd       xmm2, [bb+4]
  2  movq        rax, weights63
  3  pxor       xmm5, xmm5
  3  movdqa     xmm4, [and_constant]
  4  punpcklbw  xmm0, xmm0
  4  movdqa     xmm6, [rax]
  5  punpcklbw  xmm2, xmm2
  6  punpcklbw  xmm0, xmm0
  6  movdqa     xmm7, [rax+16]
  7  punpcklbw  xmm2, xmm2
  8  movdqa     xmm8, [rax+32]
  8  movdqa     xmm1, xmm0
  9  movdqa     xmm3, xmm2
 10  punpcklbw  xmm0, xmm0
 10  movdqa     xmm9, [rax+48]
 11  punpcklbw  xmm2, xmm2
 12  punpckhbw  xmm1, xmm1
 13  punpckhbw  xmm3, xmm3
 14  pandn      xmm0, xmm4           ; select bit out of each byte
 15  pandn      xmm1, xmm4
 16  pandn      xmm2, xmm4
 17  pandn      xmm3, xmm4
 18  pcmpeqb    xmm0, xmm5           ; convert to 0 | 0xFF
 19  pcmpeqb    xmm1, xmm5
 20  pcmpeqb    xmm2, xmm5
 21  pcmpeqb    xmm3, xmm5
 22  pand       xmm0, xmm6           ; and with weights
 23  pand       xmm1, xmm7
 24  pand       xmm2, xmm8
 25  pand       xmm3, xmm9
 26  paddusb    xmm0, xmm1           ; using paddusb allows us to take risks
 27  paddusb    xmm0, xmm2
 28  paddusb    xmm0, xmm3
 30  psadbw	xmm0, xmm7           ; horizontal add 2 * 8 byte
 34  pextrw	rdx, xmm0, 4         ; extract both intermediate sums to gp
 35  pextrw	rax, xmm0, 0
 40  add	rax, rdx             ; final add in gp

My guess is that in a tight loop this would execute in 30 cycles/iteration.  Of
course if you have any improvements on my improvements I would love to hear them
:)  I am almost certainly going to use this code once I finish going parallel.

My statement about the integer pipes is pretty simple.  The opteron can execute
1 XMM instruction per clock -> 128 bits of math.  However, it can execute 3
integer instructions per clock -> 192 bits of math.  In a future core that can
handle 4 vector instructions at once, the balance shifts, of course :)


This page took 0.01 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.