Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: is this really faster?

Author: Gerd Isenberg

Date: 01:51:06 04/18/03

On April 18, 2003 at 03:17:52, Gerd Isenberg wrote:

>On April 17, 2003 at 17:03:00, Anthony Cozzie wrote:
>
>>I ask because
>>
>>1) mmx instructions have a 2-cycle latency on the athlon,
>>2) getting the data over to the MMX pipe and back takes at *least* 6 cycles
>>3) this is not really a 64 bit operation
>>4) I tried AMDs 'optimized' version and it turned out to be much slower than the
>>simple C hack
>>
>>I'd be very interested in any performance numbers you have.
>>
>>anthony
>
>Hi anthony,
>
>Yes, that may be true for single popcounts. I use this single one very rarely
>but a lot of none inlined parallel versions to count the bits of up to four
>bitboards simultaniosly (eg. counting center/king area weighted attacks), where
>all 8-mmx registers are used and the instructions are scheduled in a proper way,
>to break dependency chains. That gains something
>(also only one final, dead slow vector path movd eax, mm0).
>
>For bitboards with low population probability i often use some inlines like
>isBitCountGreaterOne or an assembler loop version below.
>
>Of course, general purpose register instructions are faster on x86-32, but if
>you don't have enaugh of these registers ;-)
>
>Gerd
>
>

I compared the single mmx-routine with the C-routine below - and the inlined mmx
one seems to be faster in IsiChess (~1% Athlon XP2.1+), at least in some
testpositios i tried. May be due to code size and cache effects or the general
lack of registers.

Gerd


__forceinline
int BitCount (BitBoard bb)
{
#ifdef USE_C_BITCOUNT
	unsigned int l = LOWBOARD(bb);
	unsigned int h = HIGHBOARD(bb);
        l -= ((l >> 1) & 0x55555555);
        h -= ((h >> 1) & 0x55555555);
        l = (((l >> 2) & 0x33333333) + (l & 0x33333333));
        h = (((h >> 2) & 0x33333333) + (h & 0x33333333));
        l = (((l >> 4) + l) & 0x0f0f0f0f);
        h = (((h >> 4) + h) & 0x0f0f0f0f);
        l += (l >> 8);
        h += (h >> 8);
        l += (l >> 16);
        h += (h >> 16);
        return(l & 0x0000003f) + (h & 0x0000003f);

#else
	__asm
	{
		movd mm0, word ptr bb
		punpckldq mm0, word ptr bb + 4
		lea eax, [BitCountConsts]
		movq mm1,mm0
		psrld mm0,1
		pand mm0,[eax].C55
		psubd mm1,mm0
		movq mm0,mm1
		psrld mm1,2
		pand mm0,[eax].C33
		pand mm1,[eax].C33
		paddd mm0,mm1
		movq mm1,mm0
		psrld mm0,4
		paddd mm0,mm1
		pand mm0,[eax].C0F
		pxor mm1,mm1
		psadbw (mm0,mm1)
		movd eax,mm0
	}
#endif
}

Re: is this really faster? Tim Foden 02:16:04 04/18/03
- Re: is this really faster? Gerd Isenberg 03:50:53 04/18/03

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.