Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Anthony Cozzie

Date: 06:32:22 07/21/04

On July 21, 2004 at 05:37:22, Gerd Isenberg wrote:

>On July 21, 2004 at 02:03:46, Gerd Isenberg wrote:
>
>>On July 20, 2004 at 15:57:36, Anthony Cozzie wrote:
>>
>>>Two more tricks:
>>>
>>>First, if I understand the opteron correctly it can execute 3 SSE2 instructions
>>>every 2 cycles _if_ there are 2 arithmetic and 1 cache instruction in this
>>>bundle.  Therefore, I pushed all the loads farther up (this also has the
>>>advantage of mitigating cache misses), and reduced the mask to a single 16 byte
>>>constant.  This routine will only execute in 64-bit mode because it requires 10
>>>XMM registers.  Also, I used unsigned saturated addition, which means that in
>>>practice the values can exceed 63 (although you take some risks).  There is a
>>>big hole at the end of the routine: it takes 10 cycles to get the results out of
>>>the XMM register back into the integer pipe.  Since all the cache accesses are
>>>near the front of the routine, it should be possible for the processor to
>>>interleave some integer code at the end.
>>>
>>
>>Ahh, 64 bit mode already:
>>
>>
>  1  movd       xmm0, [bb]
>  1  movd       xmm2, [bb+4]
>  2  mov        rax, weights63
>  3  pxor       xmm5, xmm5
>  3  movdqa     xmm4, [and_constant]
>  4  punpcklbw  xmm0, xmm0
>  4  movdqa     xmm6, [rax] ; prefetch the line
>  5  punpcklbw  xmm2, xmm2
>  6  punpcklbw  xmm0, xmm0
>  7  punpcklbw  xmm2, xmm2
>  8  movdqa     xmm1, xmm0
>  9  movdqa     xmm3, xmm2
> 10  punpcklbw  xmm0, xmm0
> 11  punpcklbw  xmm2, xmm2
> 12  punpckhbw  xmm1, xmm1
> 13  punpckhbw  xmm3, xmm3
> 14  pandn      xmm0, xmm4           ; select bit out of each byte
> 15  pandn      xmm1, xmm4
> 16  pandn      xmm2, xmm4
> 17  pandn      xmm3, xmm4
> 18  pcmpeqb    xmm0, xmm5           ; convert to 0 | 0xFF
> 19  pcmpeqb    xmm1, xmm5
> 20  pcmpeqb    xmm2, xmm5
> 21  pcmpeqb    xmm3, xmm5
> 22  pand       xmm0, xmm6           ; and with weights
> 23  pand       xmm1, [eax+16]
> 24  pand       xmm2, [eax+32]
> 25  pand       xmm3, [eax+48]
> 26  paddusb    xmm0, xmm1           ; using paddusb allows us to take risks
> 27  paddusb    xmm0, xmm2
> 28  paddusb    xmm0, xmm3
> 30  psadbw	xmm0, xmm7           ; horizontal add 2 * 8 byte
> 34  pextrw	edx, xmm0, 4         ; extract both intermediate sums to gp
> 35  pextrw	eax, xmm0, 0
> 40  add	eax, edx             ; final add in gp
>>>
>>>My guess is that in a tight loop this would execute in 30 cycles/iteration.  Of
>>>course if you have any improvements on my improvements I would love to hear them
>>>:)  I am almost certainly going to use this code once I finish going parallel.
>>
>>Yes.
>>
>>My guess is that the pre moves don't pay off using four additional registers
>>xmm6..9 only one time. Ok, if you pass a second weight pointer one may paddb
>>them already here for some dynamic weights with saturation too. So
>>
>
>If prefetching at all, i suggest only one early
>
>    movdqa    xmm6, [rax+48]; // or [rax] see above
>
>to prefetch the weight cacheline, and later to use
>
>    pand       xmm0, [rax]           ; and with weights
>    pand       xmm1, [rax+16]
>    pand       xmm2, [rax+32]
>    pand       xmm3, xmm6
>
>Three instructions and registers less, about 12+4 bytes due to xmm8, xmm9
>require additional prefix bytes. And it is still 32-bit compatible.
>Btw. using finally 32-bit register (still the default for int) may safe another
>three bytes opcode. 32-bit ops implicitely zero extends to 64-bit.
>
>    pextrw     edx, xmm0, 4         ; extract both intermediate sums to gp
>    pextrw     eax, xmm0, 0
>    add	       eax, edx             ; final add in gp
>
>

Makes good sense.  We don't save anything by prefetching the first 3 anyway.
When I get home I'll code this up and see how long it takes in a loop.

anthony

Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 05:38:48 07/22/04
- Re: SSE2 bit[64] * byte[64] dot product Daniel Clausen 05:49:32 07/22/04
  - Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 08:07:04 07/22/04
  - Re: SSE2 bit[64] * byte[64] dot product Fabien Letouzey 06:44:15 07/22/04
    - Re: SSE2 bit[64] * byte[64] dot product Anthony Cozzie 06:56:50 07/22/04
      - Re: SSE2 bit[64] * byte[64] dot product Fabien Letouzey 07:15:33 07/22/04
        
        Re: SSE2 bit[64] * byte[64] dot product Gerd Isenberg 08:36:33 07/22/04

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.