Author: Gerd Isenberg
Date: 02:37:22 07/21/04
Go up one level in this thread
On July 21, 2004 at 02:03:46, Gerd Isenberg wrote:
>On July 20, 2004 at 15:57:36, Anthony Cozzie wrote:
>
>>Two more tricks:
>>
>>First, if I understand the opteron correctly it can execute 3 SSE2 instructions
>>every 2 cycles _if_ there are 2 arithmetic and 1 cache instruction in this
>>bundle. Therefore, I pushed all the loads farther up (this also has the
>>advantage of mitigating cache misses), and reduced the mask to a single 16 byte
>>constant. This routine will only execute in 64-bit mode because it requires 10
>>XMM registers. Also, I used unsigned saturated addition, which means that in
>>practice the values can exceed 63 (although you take some risks). There is a
>>big hole at the end of the routine: it takes 10 cycles to get the results out of
>>the XMM register back into the integer pipe. Since all the cache accesses are
>>near the front of the routine, it should be possible for the processor to
>>interleave some integer code at the end.
>>
>
>Ahh, 64 bit mode already:
>
>
1 movd xmm0, [bb]
1 movd xmm2, [bb+4]
2 mov rax, weights63
3 pxor xmm5, xmm5
3 movdqa xmm4, [and_constant]
4 punpcklbw xmm0, xmm0
4 movdqa xmm6, [rax] ; prefetch the line
5 punpcklbw xmm2, xmm2
6 punpcklbw xmm0, xmm0
7 punpcklbw xmm2, xmm2
8 movdqa xmm1, xmm0
9 movdqa xmm3, xmm2
10 punpcklbw xmm0, xmm0
11 punpcklbw xmm2, xmm2
12 punpckhbw xmm1, xmm1
13 punpckhbw xmm3, xmm3
14 pandn xmm0, xmm4 ; select bit out of each byte
15 pandn xmm1, xmm4
16 pandn xmm2, xmm4
17 pandn xmm3, xmm4
18 pcmpeqb xmm0, xmm5 ; convert to 0 | 0xFF
19 pcmpeqb xmm1, xmm5
20 pcmpeqb xmm2, xmm5
21 pcmpeqb xmm3, xmm5
22 pand xmm0, xmm6 ; and with weights
23 pand xmm1, [eax+16]
24 pand xmm2, [eax+32]
25 pand xmm3, [eax+48]
26 paddusb xmm0, xmm1 ; using paddusb allows us to take risks
27 paddusb xmm0, xmm2
28 paddusb xmm0, xmm3
30 psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte
34 pextrw edx, xmm0, 4 ; extract both intermediate sums to gp
35 pextrw eax, xmm0, 0
40 add eax, edx ; final add in gp
>>
>>My guess is that in a tight loop this would execute in 30 cycles/iteration. Of
>>course if you have any improvements on my improvements I would love to hear them
>>:) I am almost certainly going to use this code once I finish going parallel.
>
>Yes.
>
>My guess is that the pre moves don't pay off using four additional registers
>xmm6..9 only one time. Ok, if you pass a second weight pointer one may paddb
>them already here for some dynamic weights with saturation too. So
>
If prefetching at all, i suggest only one early
movdqa xmm6, [rax+48]; // or [rax] see above
to prefetch the weight cacheline, and later to use
pand xmm0, [rax] ; and with weights
pand xmm1, [rax+16]
pand xmm2, [rax+32]
pand xmm3, xmm6
Three instructions and registers less, about 12+4 bytes due to xmm8, xmm9
require additional prefix bytes. And it is still 32-bit compatible.
Btw. using finally 32-bit register (still the default for int) may safe another
three bytes opcode. 32-bit ops implicitely zero extends to 64-bit.
pextrw edx, xmm0, 4 ; extract both intermediate sums to gp
pextrw eax, xmm0, 0
add eax, edx ; final add in gp
Gerd
<snip>
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.