Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Gerd Isenberg

Date: 23:03:46 07/20/04

Go up one level in this thread


On July 20, 2004 at 15:57:36, Anthony Cozzie wrote:

>Two more tricks:
>
>First, if I understand the opteron correctly it can execute 3 SSE2 instructions
>every 2 cycles _if_ there are 2 arithmetic and 1 cache instruction in this
>bundle.  Therefore, I pushed all the loads farther up (this also has the
>advantage of mitigating cache misses), and reduced the mask to a single 16 byte
>constant.  This routine will only execute in 64-bit mode because it requires 10
>XMM registers.  Also, I used unsigned saturated addition, which means that in
>practice the values can exceed 63 (although you take some risks).  There is a
>big hole at the end of the routine: it takes 10 cycles to get the results out of
>the XMM register back into the integer pipe.  Since all the cache accesses are
>near the front of the routine, it should be possible for the processor to
>interleave some integer code at the end.
>

Ahh, 64 bit mode already:


>  1  movd       xmm0, [bb]
>  1  movd       xmm2, [bb+4]
>  2  movq        rax, weights63
>  3  pxor       xmm5, xmm5
>  3  movdqa     xmm4, [and_constant]
>  4  punpcklbw  xmm0, xmm0
>  4  movdqa     xmm6, [rax]
>  5  punpcklbw  xmm2, xmm2
>  6  punpcklbw  xmm0, xmm0
>  6  movdqa     xmm7, [rax+16]
>  7  punpcklbw  xmm2, xmm2
>  8  movdqa     xmm8, [rax+32]
>  8  movdqa     xmm1, xmm0
>  9  movdqa     xmm3, xmm2
> 10  punpcklbw  xmm0, xmm0
> 10  movdqa     xmm9, [rax+48]
> 11  punpcklbw  xmm2, xmm2
> 12  punpckhbw  xmm1, xmm1
> 13  punpckhbw  xmm3, xmm3
> 14  pandn      xmm0, xmm4           ; select bit out of each byte
> 15  pandn      xmm1, xmm4
> 16  pandn      xmm2, xmm4
> 17  pandn      xmm3, xmm4
> 18  pcmpeqb    xmm0, xmm5           ; convert to 0 | 0xFF
> 19  pcmpeqb    xmm1, xmm5
> 20  pcmpeqb    xmm2, xmm5
> 21  pcmpeqb    xmm3, xmm5
> 22  pand       xmm0, xmm6           ; and with weights
> 23  pand       xmm1, xmm7
> 24  pand       xmm2, xmm8
> 25  pand       xmm3, xmm9
> 26  paddusb    xmm0, xmm1           ; using paddusb allows us to take risks
> 27  paddusb    xmm0, xmm2
> 28  paddusb    xmm0, xmm3

Nice trick with Packed Add Unsigned with Saturation Bytes!

> 30  psadbw	xmm0, xmm7           ; horizontal add 2 * 8 byte
> 34  pextrw	rdx, xmm0, 4         ; extract both intermediate sums to gp
> 35  pextrw	rax, xmm0, 0
> 40  add	rax, rdx             ; final add in gp
>
>My guess is that in a tight loop this would execute in 30 cycles/iteration.  Of
>course if you have any improvements on my improvements I would love to hear them
>:)  I am almost certainly going to use this code once I finish going parallel.

Yes.

My guess is that the pre moves don't pay off using four additional registers
xmm6..9 only one time. Ok, if you pass a second weight pointer one may paddb
them already here for some dynamic weights with saturation too. So

>
>My statement about the integer pipes is pretty simple.  The opteron can execute
>1 XMM instruction per clock -> 128 bits of math.  However, it can execute 3
>integer instructions per clock -> 192 bits of math.  In a future core that can
>handle 4 vector instructions at once, the balance shifts, of course :)
>
>anthony

Yes, but doing SWAgpR, some additional instructions are needed as well.

Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.