Author: Anthony Cozzie
Date: 06:32:22 07/21/04
Go up one level in this thread
On July 21, 2004 at 05:37:22, Gerd Isenberg wrote: >On July 21, 2004 at 02:03:46, Gerd Isenberg wrote: > >>On July 20, 2004 at 15:57:36, Anthony Cozzie wrote: >> >>>Two more tricks: >>> >>>First, if I understand the opteron correctly it can execute 3 SSE2 instructions >>>every 2 cycles _if_ there are 2 arithmetic and 1 cache instruction in this >>>bundle. Therefore, I pushed all the loads farther up (this also has the >>>advantage of mitigating cache misses), and reduced the mask to a single 16 byte >>>constant. This routine will only execute in 64-bit mode because it requires 10 >>>XMM registers. Also, I used unsigned saturated addition, which means that in >>>practice the values can exceed 63 (although you take some risks). There is a >>>big hole at the end of the routine: it takes 10 cycles to get the results out of >>>the XMM register back into the integer pipe. Since all the cache accesses are >>>near the front of the routine, it should be possible for the processor to >>>interleave some integer code at the end. >>> >> >>Ahh, 64 bit mode already: >> >> > 1 movd xmm0, [bb] > 1 movd xmm2, [bb+4] > 2 mov rax, weights63 > 3 pxor xmm5, xmm5 > 3 movdqa xmm4, [and_constant] > 4 punpcklbw xmm0, xmm0 > 4 movdqa xmm6, [rax] ; prefetch the line > 5 punpcklbw xmm2, xmm2 > 6 punpcklbw xmm0, xmm0 > 7 punpcklbw xmm2, xmm2 > 8 movdqa xmm1, xmm0 > 9 movdqa xmm3, xmm2 > 10 punpcklbw xmm0, xmm0 > 11 punpcklbw xmm2, xmm2 > 12 punpckhbw xmm1, xmm1 > 13 punpckhbw xmm3, xmm3 > 14 pandn xmm0, xmm4 ; select bit out of each byte > 15 pandn xmm1, xmm4 > 16 pandn xmm2, xmm4 > 17 pandn xmm3, xmm4 > 18 pcmpeqb xmm0, xmm5 ; convert to 0 | 0xFF > 19 pcmpeqb xmm1, xmm5 > 20 pcmpeqb xmm2, xmm5 > 21 pcmpeqb xmm3, xmm5 > 22 pand xmm0, xmm6 ; and with weights > 23 pand xmm1, [eax+16] > 24 pand xmm2, [eax+32] > 25 pand xmm3, [eax+48] > 26 paddusb xmm0, xmm1 ; using paddusb allows us to take risks > 27 paddusb xmm0, xmm2 > 28 paddusb xmm0, xmm3 > 30 psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte > 34 pextrw edx, xmm0, 4 ; extract both intermediate sums to gp > 35 pextrw eax, xmm0, 0 > 40 add eax, edx ; final add in gp >>> >>>My guess is that in a tight loop this would execute in 30 cycles/iteration. Of >>>course if you have any improvements on my improvements I would love to hear them >>>:) I am almost certainly going to use this code once I finish going parallel. >> >>Yes. >> >>My guess is that the pre moves don't pay off using four additional registers >>xmm6..9 only one time. Ok, if you pass a second weight pointer one may paddb >>them already here for some dynamic weights with saturation too. So >> > >If prefetching at all, i suggest only one early > > movdqa xmm6, [rax+48]; // or [rax] see above > >to prefetch the weight cacheline, and later to use > > pand xmm0, [rax] ; and with weights > pand xmm1, [rax+16] > pand xmm2, [rax+32] > pand xmm3, xmm6 > >Three instructions and registers less, about 12+4 bytes due to xmm8, xmm9 >require additional prefix bytes. And it is still 32-bit compatible. >Btw. using finally 32-bit register (still the default for int) may safe another >three bytes opcode. 32-bit ops implicitely zero extends to 64-bit. > > pextrw edx, xmm0, 4 ; extract both intermediate sums to gp > pextrw eax, xmm0, 0 > add eax, edx ; final add in gp > > Makes good sense. We don't save anything by prefetching the first 3 anyway. When I get home I'll code this up and see how long it takes in a loop. anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.