Author: Gerd Isenberg
Date: 02:37:22 07/21/04
Go up one level in this thread
On July 21, 2004 at 02:03:46, Gerd Isenberg wrote: >On July 20, 2004 at 15:57:36, Anthony Cozzie wrote: > >>Two more tricks: >> >>First, if I understand the opteron correctly it can execute 3 SSE2 instructions >>every 2 cycles _if_ there are 2 arithmetic and 1 cache instruction in this >>bundle. Therefore, I pushed all the loads farther up (this also has the >>advantage of mitigating cache misses), and reduced the mask to a single 16 byte >>constant. This routine will only execute in 64-bit mode because it requires 10 >>XMM registers. Also, I used unsigned saturated addition, which means that in >>practice the values can exceed 63 (although you take some risks). There is a >>big hole at the end of the routine: it takes 10 cycles to get the results out of >>the XMM register back into the integer pipe. Since all the cache accesses are >>near the front of the routine, it should be possible for the processor to >>interleave some integer code at the end. >> > >Ahh, 64 bit mode already: > > 1 movd xmm0, [bb] 1 movd xmm2, [bb+4] 2 mov rax, weights63 3 pxor xmm5, xmm5 3 movdqa xmm4, [and_constant] 4 punpcklbw xmm0, xmm0 4 movdqa xmm6, [rax] ; prefetch the line 5 punpcklbw xmm2, xmm2 6 punpcklbw xmm0, xmm0 7 punpcklbw xmm2, xmm2 8 movdqa xmm1, xmm0 9 movdqa xmm3, xmm2 10 punpcklbw xmm0, xmm0 11 punpcklbw xmm2, xmm2 12 punpckhbw xmm1, xmm1 13 punpckhbw xmm3, xmm3 14 pandn xmm0, xmm4 ; select bit out of each byte 15 pandn xmm1, xmm4 16 pandn xmm2, xmm4 17 pandn xmm3, xmm4 18 pcmpeqb xmm0, xmm5 ; convert to 0 | 0xFF 19 pcmpeqb xmm1, xmm5 20 pcmpeqb xmm2, xmm5 21 pcmpeqb xmm3, xmm5 22 pand xmm0, xmm6 ; and with weights 23 pand xmm1, [eax+16] 24 pand xmm2, [eax+32] 25 pand xmm3, [eax+48] 26 paddusb xmm0, xmm1 ; using paddusb allows us to take risks 27 paddusb xmm0, xmm2 28 paddusb xmm0, xmm3 30 psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte 34 pextrw edx, xmm0, 4 ; extract both intermediate sums to gp 35 pextrw eax, xmm0, 0 40 add eax, edx ; final add in gp >> >>My guess is that in a tight loop this would execute in 30 cycles/iteration. Of >>course if you have any improvements on my improvements I would love to hear them >>:) I am almost certainly going to use this code once I finish going parallel. > >Yes. > >My guess is that the pre moves don't pay off using four additional registers >xmm6..9 only one time. Ok, if you pass a second weight pointer one may paddb >them already here for some dynamic weights with saturation too. So > If prefetching at all, i suggest only one early movdqa xmm6, [rax+48]; // or [rax] see above to prefetch the weight cacheline, and later to use pand xmm0, [rax] ; and with weights pand xmm1, [rax+16] pand xmm2, [rax+32] pand xmm3, xmm6 Three instructions and registers less, about 12+4 bytes due to xmm8, xmm9 require additional prefix bytes. And it is still 32-bit compatible. Btw. using finally 32-bit register (still the default for int) may safe another three bytes opcode. 32-bit ops implicitely zero extends to 64-bit. pextrw edx, xmm0, 4 ; extract both intermediate sums to gp pextrw eax, xmm0, 0 add eax, edx ; final add in gp Gerd <snip>
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.