Author: Gerd Isenberg
Date: 23:03:46 07/20/04
Go up one level in this thread
On July 20, 2004 at 15:57:36, Anthony Cozzie wrote: >Two more tricks: > >First, if I understand the opteron correctly it can execute 3 SSE2 instructions >every 2 cycles _if_ there are 2 arithmetic and 1 cache instruction in this >bundle. Therefore, I pushed all the loads farther up (this also has the >advantage of mitigating cache misses), and reduced the mask to a single 16 byte >constant. This routine will only execute in 64-bit mode because it requires 10 >XMM registers. Also, I used unsigned saturated addition, which means that in >practice the values can exceed 63 (although you take some risks). There is a >big hole at the end of the routine: it takes 10 cycles to get the results out of >the XMM register back into the integer pipe. Since all the cache accesses are >near the front of the routine, it should be possible for the processor to >interleave some integer code at the end. > Ahh, 64 bit mode already: > 1 movd xmm0, [bb] > 1 movd xmm2, [bb+4] > 2 movq rax, weights63 > 3 pxor xmm5, xmm5 > 3 movdqa xmm4, [and_constant] > 4 punpcklbw xmm0, xmm0 > 4 movdqa xmm6, [rax] > 5 punpcklbw xmm2, xmm2 > 6 punpcklbw xmm0, xmm0 > 6 movdqa xmm7, [rax+16] > 7 punpcklbw xmm2, xmm2 > 8 movdqa xmm8, [rax+32] > 8 movdqa xmm1, xmm0 > 9 movdqa xmm3, xmm2 > 10 punpcklbw xmm0, xmm0 > 10 movdqa xmm9, [rax+48] > 11 punpcklbw xmm2, xmm2 > 12 punpckhbw xmm1, xmm1 > 13 punpckhbw xmm3, xmm3 > 14 pandn xmm0, xmm4 ; select bit out of each byte > 15 pandn xmm1, xmm4 > 16 pandn xmm2, xmm4 > 17 pandn xmm3, xmm4 > 18 pcmpeqb xmm0, xmm5 ; convert to 0 | 0xFF > 19 pcmpeqb xmm1, xmm5 > 20 pcmpeqb xmm2, xmm5 > 21 pcmpeqb xmm3, xmm5 > 22 pand xmm0, xmm6 ; and with weights > 23 pand xmm1, xmm7 > 24 pand xmm2, xmm8 > 25 pand xmm3, xmm9 > 26 paddusb xmm0, xmm1 ; using paddusb allows us to take risks > 27 paddusb xmm0, xmm2 > 28 paddusb xmm0, xmm3 Nice trick with Packed Add Unsigned with Saturation Bytes! > 30 psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte > 34 pextrw rdx, xmm0, 4 ; extract both intermediate sums to gp > 35 pextrw rax, xmm0, 0 > 40 add rax, rdx ; final add in gp > >My guess is that in a tight loop this would execute in 30 cycles/iteration. Of >course if you have any improvements on my improvements I would love to hear them >:) I am almost certainly going to use this code once I finish going parallel. Yes. My guess is that the pre moves don't pay off using four additional registers xmm6..9 only one time. Ok, if you pass a second weight pointer one may paddb them already here for some dynamic weights with saturation too. So > >My statement about the integer pipes is pretty simple. The opteron can execute >1 XMM instruction per clock -> 128 bits of math. However, it can execute 3 >integer instructions per clock -> 192 bits of math. In a future core that can >handle 4 vector instructions at once, the balance shifts, of course :) > >anthony Yes, but doing SWAgpR, some additional instructions are needed as well. Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.