Author: Gerd Isenberg
Date: 05:38:48 07/22/04
Go up one level in this thread
<snip> >> 1 movd xmm0, [bb] >> 1 movd xmm2, [bb+4] >> 2 mov rax, weights63 >> 3 pxor xmm5, xmm5 >> 3 movdqa xmm4, [and_constant] >> 4 punpcklbw xmm0, xmm0 >> 4 movdqa xmm6, [rax] ; prefetch the line >> 5 punpcklbw xmm2, xmm2 >> 6 punpcklbw xmm0, xmm0 >> 7 punpcklbw xmm2, xmm2 >> 8 movdqa xmm1, xmm0 >> 9 movdqa xmm3, xmm2 >> 10 punpcklbw xmm0, xmm0 >> 11 punpcklbw xmm2, xmm2 >> 12 punpckhbw xmm1, xmm1 >> 13 punpckhbw xmm3, xmm3 >> 14 pandn xmm0, xmm4 ; select bit out of each byte >> 15 pandn xmm1, xmm4 >> 16 pandn xmm2, xmm4 >> 17 pandn xmm3, xmm4 >> 18 pcmpeqb xmm0, xmm5 ; convert to 0 | 0xFF >> 19 pcmpeqb xmm1, xmm5 >> 20 pcmpeqb xmm2, xmm5 >> 21 pcmpeqb xmm3, xmm5 >> 22 pand xmm0, xmm6 ; and with weights >> 23 pand xmm1, [eax+16] >> 24 pand xmm2, [eax+32] >> 25 pand xmm3, [eax+48] >> 26 paddusb xmm0, xmm1 ; using paddusb allows us to take risks >> 27 paddusb xmm0, xmm2 >> 28 paddusb xmm0, xmm3 >> 30 psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte >> 34 pextrw edx, xmm0, 4 ; extract both intermediate sums to gp >> 35 pextrw eax, xmm0, 0 >> 40 add eax, edx ; final add in gp >>>> >>>>My guess is that in a tight loop this would execute in 30 cycles/iteration. Of >>>>course if you have any improvements on my improvements I would love to hear them >>>>:) I am almost certainly going to use this code once I finish going parallel. >>> >>>Yes. >>> >>>My guess is that the pre moves don't pay off using four additional registers >>>xmm6..9 only one time. Ok, if you pass a second weight pointer one may paddb >>>them already here for some dynamic weights with saturation too. So >>> >> >>If prefetching at all, i suggest only one early >> >> movdqa xmm6, [rax+48]; // or [rax] see above >> >>to prefetch the weight cacheline, and later to use >> >> pand xmm0, [rax] ; and with weights >> pand xmm1, [rax+16] >> pand xmm2, [rax+32] >> pand xmm3, xmm6 >> >>Three instructions and registers less, about 12+4 bytes due to xmm8, xmm9 >>require additional prefix bytes. And it is still 32-bit compatible. >>Btw. using finally 32-bit register (still the default for int) may safe another >>three bytes opcode. 32-bit ops implicitely zero extends to 64-bit. >> >> pextrw edx, xmm0, 4 ; extract both intermediate sums to gp >> pextrw eax, xmm0, 0 >> add eax, edx ; final add in gp >> >> > >Makes good sense. We don't save anything by prefetching the first 3 anyway. >When I get home I'll code this up and see how long it takes in a loop. > >anthony Your original shuffling sequence makes sense, if bb on the stack is not 8- but only 4-byte aligned. movd xmm0, [bb] movd xmm2, [bb+4] punpcklbw xmm0, xmm0 punpcklbw xmm2, xmm2 punpcklbw xmm0, xmm0 punpcklbw xmm2, xmm2 movdqa xmm1, xmm0 movdqa xmm3, xmm2 punpcklbw xmm0, xmm0 punpcklbw xmm2, xmm2 punpckhbw xmm1, xmm1 punpckhbw xmm3, xmm3 Another similar shuffling sequence, one instruction less and therefore slightly faster with appropriate scheduling of independent initialization instructions. It may be used with passed bitboard already in xmm0: movq xmm0, [bb] ; 0x0000000000000000:0xf0e1d2c3b4a59687 punpcklbw xmm0, xmm0 ; 0xf0f0e1e1d2d2c3c3:0xb4b4a5a596968787 movdqa xmm2, xmm0 punpcklwd xmm0, xmm0 ; 0xb4b4b4b4a5a5a5a5:0x9696969687878787 punpckhwd xmm2, xmm2 ; 0xf0f0f0f0e1e1e1e1:0xd2d2d2d2c3c3c3c3 movdqa xmm1, xmm0 movdqa xmm3, xmm2 punpckldq xmm0, xmm0 ; 0x9696969696969696:0x8787878787878787 punpckhdq xmm1, xmm1 ; 0xb4b4b4b4b4b4b4b4:0xa5a5a5a5a5a5a5a5 punpckldq xmm2, xmm2 ; 0xd2d2d2d2d2d2d2d2:0xc3c3c3c3c3c3c3c3 punpckhdq xmm3, xmm3 ; 0xf0f0f0f0f0f0f0f0:0xe1e1e1e1e1e1e1e1 The Unpack and Interleave instructions became familar now ;-) Anyway i'm not quite sure whether to take this additional overhead or to stay with the rotated weights. movq xmm0, [bb] ; 0x0000000000000000:0xf0e1d2c3b4a59687 punpcklqdq xmm0, xmm0 ; 0xf0e1d2c3b4a59687:0xf0e1d2c3b4a59687 movdqa xmm1, xmm0 ; 0xf0e1d2c3b4a59687:0xf0e1d2c3b4a59687 movdqa xmm2, xmm0 ; 0xf0e1d2c3b4a59687:0xf0e1d2c3b4a59687 movdqa xmm3, xmm0 ; 0xf0e1d2c3b4a59687:0xf0e1d2c3b4a59687 Rotating a square index here and there is not that expensive... I plan most weights already precomputed and probably indexed by some (king)squares and side. Gerd
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.