Author: Gerd Isenberg
Date: 12:04:20 08/09/04
for P3, P4, Athlon32 or AMD64
This SIMD-MMX Dot Product seems slightly faster on my AMD64 box than the
SSE2-one posted recently (in a dump loop 13.3ns versus 13.7ns, 2.2GHz).
Code size is 190 Byte MMX versus 141 Byte SSE2!
So the "future" for AMD64 is SSE2, with wider or even more SIMD-alus...
Anyway, a small description again:
The SIMD-routine complely unrolls following loop, where bb is a bit-array and
weight a byte-array, 64 elements each:
unsigned int dotpro = 0;
for (int i=0; i<64; i++)
dotpro += bb[i]*weight[i];
1. A 64-bit word is copied to all eight MMX registers (8*64 = 512 bit),
shuffling the bytes that each register contains eight copies of each byte of the
bitboard (by the "packed unpack low/high instructions"). The shuffling sequence
with 128-bit XMM-registers was introduced by Anthony Cozzie.
2. Each of the eight bits of each "expanded byte" is masked by anding with
0x8040201008040201.
3. A parallel bytewise compare with != zero leaves each original bit
sign extended to a byte (0,-1 or unsigned 255) - in all eight MMX registers
(512bit).
4. The bit[i] * weight[i] multiplication is performed by parallel and.
5. Finally all byte products are added, 2*3 vertical bytewise adds with
saturation (255), two horizontal adds and one final vertical 16-bit add.
MSC inline assembly:
typedef unsigned __int64 BitBoard;
#define MMXALIGN __declspec(align(8))
/*
dotProduct = bit[64] * byte[64]
weights should in average not exceed 63
within this implementation!
*/
int dotProductMMX(BitBoard bb, unsigned char weights[])
{
static const BitBoard MMXALIGN consts[2] = {0x8040201008040201, 0};
__asm
{
movq mm0, [bb]
lea eax, [consts]
movq mm4, mm0
punpcklbw mm0, mm0
punpckhbw mm4, mm4
lea edx, [eax+8] ; &null
movq mm2, mm0
movq mm6, mm4
punpcklwd mm0, mm0
punpckhwd mm2, mm2
punpcklwd mm4, mm4
punpckhwd mm6, mm6
movq mm1, mm0
movq mm3, mm2
movq mm5, mm4
movq mm7, mm6
punpckldq mm0, mm0
punpckhdq mm1, mm1
punpckldq mm2, mm2
punpckhdq mm3, mm3
punpckldq mm4, mm4
punpckhdq mm5, mm5
punpckldq mm6, mm6
punpckhdq mm7, mm7 ; each byte copied eight times in one register
pandn mm0, [eax] ; and the bits
pandn mm1, [eax]
pandn mm2, [eax]
pandn mm3, [eax]
pandn mm4, [eax]
pandn mm5, [eax]
pandn mm6, [eax]
pandn mm7, [eax]
mov eax, [weights]
pcmpeqb mm0, [edx] ; != zero ==> 0xff
pcmpeqb mm1, [edx]
pcmpeqb mm2, [edx]
pcmpeqb mm3, [edx]
pcmpeqb mm4, [edx]
pcmpeqb mm5, [edx]
pcmpeqb mm6, [edx]
pcmpeqb mm7, [edx]
; now the bit->byte sign extension is done 64->512
pand mm0, [eax+0*8] ; * weight[i]
pand mm1, [eax+1*8]
pand mm2, [eax+2*8]
pand mm3, [eax+3*8]
pand mm4, [eax+4*8]
pand mm5, [eax+5*8]
pand mm6, [eax+6*8]
pand mm7, [eax+7*8]
paddusb mm0, mm1 ; add all bytes
paddusb mm4, mm5
paddusb mm0, mm2
paddusb mm4, mm6
paddusb mm0, mm3
paddusb mm4, mm7
psadbw mm0, [edx] ; horizontal add 8 byte to one word
psadbw mm4, [edx] ; horizontal add 8 byte to one word
paddw mm0, mm4
movd eax, mm0
}
}
------------------------------------------------------------------------------
Out of curiosity ...
Popcounting with (rotated) one-weights takes "only" about twice the time the
mmx-popcount from AMD's Athlon optimization manual takes. Of course saving the
pand "one" but subtracting {0xff|0} from zero.
Another idea with the (rotated) 64->512 bit->byte sign extension is some kind of
parallel bitscan. To perform an parallel "and" with a vector containing all
bitindex+1 (1..64 to distinguish from zero). A later parallel add of -1 results
in proper bitindices as well as -1 byte for the not set bits.
Unfortunately if you like to traverse the "scanned" bit results, one has to look
for all 64-bytes and to skip all -1 or 0xff-bytes of the former associated zero
bits. There seems no cheap way to "pack" only the valid indices into one stream.
Anyway, i am interested in an branchless algorithm to sort 8/16 bytes inside one
SIMD-register (using one/two temporary registers), probably using PMAXUB, PMINUB
instructions and/or performing some 0x00ff 0xff00 word-masks ;-)
Cheers,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.