Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: is this really faster?

Author: Gerd Isenberg

Date: 00:17:52 04/18/03

Go up one level in this thread


On April 17, 2003 at 17:03:00, Anthony Cozzie wrote:

>I ask because
>
>1) mmx instructions have a 2-cycle latency on the athlon,
>2) getting the data over to the MMX pipe and back takes at *least* 6 cycles
>3) this is not really a 64 bit operation
>4) I tried AMDs 'optimized' version and it turned out to be much slower than the
>simple C hack
>
>I'd be very interested in any performance numbers you have.
>
>anthony

Hi anthony,

Yes, that may be true for single popcounts. I use this single one very rarely
but a lot of none inlined parallel versions to count the bits of up to four
bitboards simultaniosly (eg. counting center/king area weighted attacks), where
all 8-mmx registers are used and the instructions are scheduled in a proper way,
to break dependency chains. That gains something
(also only one final, dead slow vector path movd eax, mm0).

For bitboards with low population probability i often use some inlines like
isBitCountGreaterOne or an assembler loop version below.

Of course, general purpose register instructions are faster on x86-32, but if
you don't have enaugh of these registers ;-)

Gerd


static __forceinline int _fastcall BitCount8 (BitBoard bb)
{
	__asm
	{
		mov     ecx, dword ptr bb
		xor     eax, eax
		test    ecx, ecx
		jz      l1
	    l0: lea     edx, [ecx-1]
		inc     eax
		and     ecx, edx
		jnz     l0
	    l1: mov     ecx, dword ptr bb+4
		test    ecx, ecx
		jz      l3
	    l2: lea     edx, [ecx-1]
		inc     eax
		and     ecx, edx
		jnz     l2
	    l3:
	}
}






This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.