Author: Anthony Cozzie
Date: 12:57:36 07/20/04
Go up one level in this thread
Two more tricks: First, if I understand the opteron correctly it can execute 3 SSE2 instructions every 2 cycles _if_ there are 2 arithmetic and 1 cache instruction in this bundle. Therefore, I pushed all the loads farther up (this also has the advantage of mitigating cache misses), and reduced the mask to a single 16 byte constant. This routine will only execute in 64-bit mode because it requires 10 XMM registers. Also, I used unsigned saturated addition, which means that in practice the values can exceed 63 (although you take some risks). There is a big hole at the end of the routine: it takes 10 cycles to get the results out of the XMM register back into the integer pipe. Since all the cache accesses are near the front of the routine, it should be possible for the processor to interleave some integer code at the end. 1 movd xmm0, [bb] 1 movd xmm2, [bb+4] 2 movq rax, weights63 3 pxor xmm5, xmm5 3 movdqa xmm4, [and_constant] 4 punpcklbw xmm0, xmm0 4 movdqa xmm6, [rax] 5 punpcklbw xmm2, xmm2 6 punpcklbw xmm0, xmm0 6 movdqa xmm7, [rax+16] 7 punpcklbw xmm2, xmm2 8 movdqa xmm8, [rax+32] 8 movdqa xmm1, xmm0 9 movdqa xmm3, xmm2 10 punpcklbw xmm0, xmm0 10 movdqa xmm9, [rax+48] 11 punpcklbw xmm2, xmm2 12 punpckhbw xmm1, xmm1 13 punpckhbw xmm3, xmm3 14 pandn xmm0, xmm4 ; select bit out of each byte 15 pandn xmm1, xmm4 16 pandn xmm2, xmm4 17 pandn xmm3, xmm4 18 pcmpeqb xmm0, xmm5 ; convert to 0 | 0xFF 19 pcmpeqb xmm1, xmm5 20 pcmpeqb xmm2, xmm5 21 pcmpeqb xmm3, xmm5 22 pand xmm0, xmm6 ; and with weights 23 pand xmm1, xmm7 24 pand xmm2, xmm8 25 pand xmm3, xmm9 26 paddusb xmm0, xmm1 ; using paddusb allows us to take risks 27 paddusb xmm0, xmm2 28 paddusb xmm0, xmm3 30 psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte 34 pextrw rdx, xmm0, 4 ; extract both intermediate sums to gp 35 pextrw rax, xmm0, 0 40 add rax, rdx ; final add in gp My guess is that in a tight loop this would execute in 30 cycles/iteration. Of course if you have any improvements on my improvements I would love to hear them :) I am almost certainly going to use this code once I finish going parallel. My statement about the integer pipes is pretty simple. The opteron can execute 1 XMM instruction per clock -> 128 bits of math. However, it can execute 3 integer instructions per clock -> 192 bits of math. In a future core that can handle 4 vector instructions at once, the balance shifts, of course :) anthony
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.