Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: is this really faster?

Author: Gerd Isenberg

Date: 03:50:53 04/18/03

Go up one level in this thread


On April 18, 2003 at 05:16:04, Tim Foden wrote:

>On April 18, 2003 at 04:51:06, Gerd Isenberg wrote:
>
>>On April 18, 2003 at 03:17:52, Gerd Isenberg wrote:
>>
>>>On April 17, 2003 at 17:03:00, Anthony Cozzie wrote:
>>>
>>>>I ask because
>>>>
>>>>1) mmx instructions have a 2-cycle latency on the athlon,
>>>>2) getting the data over to the MMX pipe and back takes at *least* 6 cycles
>>>>3) this is not really a 64 bit operation
>>>>4) I tried AMDs 'optimized' version and it turned out to be much slower than the
>>>>simple C hack
>>>>
>>>>I'd be very interested in any performance numbers you have.
>>>>
>>>>anthony
>>>
>>>Hi anthony,
>>>
>>>Yes, that may be true for single popcounts. I use this single one very rarely
>>>but a lot of none inlined parallel versions to count the bits of up to four
>>>bitboards simultaniosly (eg. counting center/king area weighted attacks), where
>>>all 8-mmx registers are used and the instructions are scheduled in a proper way,
>>>to break dependency chains. That gains something
>>>(also only one final, dead slow vector path movd eax, mm0).
>>>
>>>For bitboards with low population probability i often use some inlines like
>>>isBitCountGreaterOne or an assembler loop version below.
>>>
>>>Of course, general purpose register instructions are faster on x86-32, but if
>>>you don't have enaugh of these registers ;-)
>>>
>>>Gerd
>>>
>>>
>>
>>I compared the single mmx-routine with the C-routine below - and the inlined mmx
>>one seems to be faster in IsiChess (~1% Athlon XP2.1+), at least in some
>>testpositios i tried. May be due to code size and cache effects or the general
>>lack of registers.
>>
>>Gerd
>>
>>
>>__forceinline
>>int BitCount (BitBoard bb)
>>{
>>#ifdef USE_C_BITCOUNT
>>	unsigned int l = LOWBOARD(bb);
>>	unsigned int h = HIGHBOARD(bb);
>>        l -= ((l >> 1) & 0x55555555);
>>        h -= ((h >> 1) & 0x55555555);
>>        l = (((l >> 2) & 0x33333333) + (l & 0x33333333));
>>        h = (((h >> 2) & 0x33333333) + (h & 0x33333333));
>>        l = (((l >> 4) + l) & 0x0f0f0f0f);
>>        h = (((h >> 4) + h) & 0x0f0f0f0f);
>>        l += (l >> 8);
>>        h += (h >> 8);
>>        l += (l >> 16);
>>        h += (h >> 16);
>>        return(l & 0x0000003f) + (h & 0x0000003f);
>
>Hey Gerd,
>
>Ever tried changing the above to something like this?
>
>>	unsigned int l = LOWBOARD(bb);
>>	unsigned int h = HIGHBOARD(bb);
>>        l -= ((l >> 1) & 0x55555555);
>>        h -= ((h >> 1) & 0x55555555);
>>        l = (((l >> 2) & 0x33333333) + (l & 0x33333333));
>>        h = (((h >> 2) & 0x33333333) + (h & 0x33333333));
>>        l = (((l >> 4) + l) & 0x0f0f0f0f) + (((h >> 4) + h) & 0x0f0f0f0f);
>>        l += (l >> 8);
>>        l += (l >> 16);
>>        return (l & 0x0000007f);
>
>Or even this?
>
>>	unsigned int l = LOWBOARD(bb);
>>	unsigned int h = HIGHBOARD(bb);
>>        l -= ((l >> 1) & 0x55555555);
>>        h -= ((h >> 1) & 0x55555555);
>>        l = (((l >> 2) & 0x33333333) + (l & 0x33333333)) +
>>            (((h >> 2) & 0x33333333) + (h & 0x33333333));
>>        l = (((l >> 4) & 0x0f0f0f0f) + (l & 0x0f0f0f0f));
>>        l += (l >> 8);
>>        l += (l >> 16);
>>        return (l & 0x0000007f);
>
>Take care... I just edited them... I haven't tested the changes at all, so they
>may have bugs in... but I guess you see what I'm getting at.
>
>Cheers, Tim.

Hi Tim,

yes one may save some instructions in some clever way - but one may introduce a
few more dependencies. As i already mentioned, i'm rather happy with the
mmx-stuff, specially with parallized versions to count multiple bitboards. So i
don't like to put too much energy in this popcount stuff, because of lack of
time - and x86-64 is near the horizon. I guess using 64-bit general purpose
registers is unbeatable then ;-)

Sample routine to count attack squares with following weighting factors:

 1 1 1 1 1 1 1 1
 1 2 2 2 2 2 2 1
 1 2 3 3 3 3 2 1
 1 2 3 4 4 3 2 1
 1 2 3 4 4 3 2 1
 1 2 3 3 3 3 2 1
 1 2 2 2 2 2 2 1
 1 1 1 1 1 1 1 1

I count the center / extended center squares in one mmx-register and pass an
additional bitboard eg. with neighboard king squares or passers stop squares.

static
int WeightedBitCount (BitBoard weightbb, BitBoard additional)
{
	__asm
	{
		movd mm0, word ptr weightbb
		punpckldq mm0, word ptr weightbb + 4
		movd mm6, word ptr additional
		punpckldq mm6, word ptr additional + 4

		lea	 eax, [BitCountConsts]
		movq mm2,mm0		; bb
		movq mm4,mm0		; bb
		movq mm7,mm6
		psrlq mm2,16		; shift extended center to low board
		pand mm4, [eax].NOTBOARDER
		punpckldq mm2,mm2	; mirror it to high board
		movq mm1,mm0		; bb
		pand mm2, [eax].CENTER_MASK
                ; mask mirrored center as well as extended center

		movq mm5,mm4
		psrld mm0,1
		movq mm3,mm2
		psrld mm2,1
		psrld mm4,1
		pand mm0, [eax].C55
		psrld mm6,1

		pand mm2, [eax].C55
		pand mm4, [eax].C55
		psubd mm1,mm0
		pand mm6, [eax].C55
		psubd mm3,mm2
		psubd mm5,mm4
		movq mm0,mm1
		psubd mm7,mm6
		movq mm2,mm3
		movq mm4,mm5
		psrld mm1,2
		movq mm6,mm7

		psrld mm3,2
		psrld mm5,2
		pand mm0,[eax].C33
		pand mm1,[eax].C33
		psrld mm7,2
		pand mm2,[eax].C33
		pand mm3,[eax].C33
		pand mm4,[eax].C33
		pand mm5,[eax].C33
		paddd mm0,mm1
		pand mm6,[eax].C33
		pand mm7,[eax].C33
		paddd mm2,mm3
		paddd mm4,mm5
		movq mm1,mm0
		paddd mm6,mm7
		movq mm3,mm2
		movq mm5,mm4
		psrld mm0,4
		movq mm7,mm6

		psrld mm2,4
		psrld mm4,4
		paddd mm0,mm1
		psrld mm6,4
		paddd mm2,mm3
		paddd mm4,mm5
		pxor mm1,mm1
		pand mm0,[eax].C0F
		paddd mm6,mm7
		pand mm2,[eax].C0F
		pand mm4,[eax].C0F
		psadbw (mm0,mm1)
		pand mm6,[eax].C0F

		psadbw (mm2,mm1)
		psadbw (mm4,mm1)
		paddd mm0,mm2
		psadbw (mm6,mm1)
		paddd mm0,mm4
		paddd mm0,mm6

		movd eax,mm0
	}
}

const SBitCountConsts CACHE_ALIGN BitCountConsts =
{
	0x3333333333333333,	// C33
	0x5555555555555555,	// C55
	0x0f0f0f0f0f0f0f0f,	// C0F
	0x007e7e7e7e7e7e00,	// NOTBOARDER
	0x3c3c3c3c00181800,	// CENTER_MASK
// (shifted and high low mirrored extended center and center mask)
};

Cheers,
Gerd




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.