Computer Chess Club Archives

Search

Terms
Messages

Subject: MMX Dot Product bit[64] * byte[64]

Author: Gerd Isenberg
Date: 12:04:20 08/09/04
for P3, P4, Athlon32 or AMD64

This SIMD-MMX Dot Product seems slightly faster on my AMD64 box than the
SSE2-one posted recently (in a dump loop 13.3ns versus 13.7ns, 2.2GHz).
Code size is 190 Byte MMX versus 141 Byte SSE2!
So the "future" for AMD64 is SSE2, with wider or even more SIMD-alus...

Anyway, a small description again:

The SIMD-routine complely unrolls following loop, where bb is a bit-array and
weight a byte-array, 64 elements each:

  unsigned int dotpro = 0;
  for (int i=0; i<64; i++)
    dotpro += bb[i]*weight[i];


1. A 64-bit word is copied to all eight MMX registers (8*64 = 512 bit),
shuffling the bytes that each register contains eight copies of each byte of the
bitboard (by the "packed unpack low/high instructions"). The shuffling sequence
with 128-bit XMM-registers was introduced by Anthony Cozzie.

2. Each of the eight bits of each "expanded byte" is masked by anding with
0x8040201008040201.

3. A parallel bytewise compare with != zero leaves each original bit
sign extended to a byte (0,-1 or unsigned 255) - in all eight MMX registers
(512bit).

4. The bit[i] * weight[i] multiplication is performed by parallel and.

5. Finally all byte products are added, 2*3 vertical bytewise adds with
saturation (255), two horizontal adds and one final vertical 16-bit add.


MSC inline assembly:

typedef unsigned __int64 BitBoard;
#define MMXALIGN __declspec(align(8))

/*
     dotProduct = bit[64] * byte[64]
     weights should in average not exceed 63
      within this implementation!
*/

int dotProductMMX(BitBoard bb, unsigned char weights[])
{
    static const BitBoard MMXALIGN consts[2] = {0x8040201008040201, 0};
    __asm
    {
        movq	mm0, [bb]
        lea	eax, [consts]
        movq	mm4, mm0
        punpcklbw mm0, mm0
        punpckhbw mm4, mm4
        lea	edx, [eax+8]  ; &null
        movq	mm2, mm0
        movq	mm6, mm4
        punpcklwd mm0, mm0
        punpckhwd mm2, mm2
        punpcklwd mm4, mm4
        punpckhwd mm6, mm6
        movq	mm1, mm0
        movq	mm3, mm2
        movq	mm5, mm4
        movq	mm7, mm6
        punpckldq mm0, mm0
        punpckhdq mm1, mm1
        punpckldq mm2, mm2
        punpckhdq mm3, mm3
        punpckldq mm4, mm4
        punpckhdq mm5, mm5
        punpckldq mm6, mm6
        punpckhdq mm7, mm7      ; each byte copied eight times in one register
        pandn	mm0, [eax]      ; and the bits
        pandn	mm1, [eax]
        pandn	mm2, [eax]
        pandn	mm3, [eax]
        pandn	mm4, [eax]
        pandn	mm5, [eax]
        pandn	mm6, [eax]
        pandn	mm7, [eax]
        mov	eax, [weights]
        pcmpeqb	mm0, [edx]      ; != zero ==> 0xff
        pcmpeqb	mm1, [edx]
        pcmpeqb	mm2, [edx]
        pcmpeqb	mm3, [edx]
        pcmpeqb	mm4, [edx]
        pcmpeqb	mm5, [edx]
        pcmpeqb	mm6, [edx]
        pcmpeqb	mm7, [edx]
        ; now the bit->byte sign extension is done 64->512
        pand	mm0, [eax+0*8]  ; * weight[i]
        pand	mm1, [eax+1*8]
        pand	mm2, [eax+2*8]
        pand	mm3, [eax+3*8]
        pand	mm4, [eax+4*8]
        pand	mm5, [eax+5*8]
        pand	mm6, [eax+6*8]
        pand	mm7, [eax+7*8]
        paddusb	mm0, mm1	; add all bytes
        paddusb	mm4, mm5
        paddusb	mm0, mm2
        paddusb	mm4, mm6
        paddusb	mm0, mm3
        paddusb	mm4, mm7
        psadbw	mm0, [edx]	; horizontal add 8 byte to one word
        psadbw	mm4, [edx]	; horizontal add 8 byte to one word
        paddw	mm0, mm4
        movd	eax, mm0
    }
}

------------------------------------------------------------------------------

Out of curiosity ...

Popcounting with (rotated) one-weights takes "only" about twice the time the
mmx-popcount from AMD's Athlon optimization manual takes. Of course saving the
pand "one" but subtracting {0xff|0} from zero.

Another idea with the (rotated) 64->512 bit->byte sign extension is some kind of
parallel bitscan. To perform an parallel "and" with a vector containing all
bitindex+1 (1..64 to distinguish from zero). A later parallel add of -1 results
in proper bitindices as well as -1 byte for the not set bits.

Unfortunately if you like to traverse the "scanned" bit results, one has to look
for all 64-bytes and to skip all -1 or 0xff-bytes of the former associated zero
bits. There seems no cheap way to "pack" only the valid indices into one stream.

Anyway, i am interested in an branchless algorithm to sort 8/16 bytes inside one
SIMD-register (using one/two temporary registers), probably using PMAXUB, PMINUB
instructions and/or performing some 0x00ff 0xff00 word-masks ;-)

Cheers,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.