Author: Gerd Isenberg
Date: 00:17:52 04/18/03
Go up one level in this thread
On April 17, 2003 at 17:03:00, Anthony Cozzie wrote:
>I ask because
>
>1) mmx instructions have a 2-cycle latency on the athlon,
>2) getting the data over to the MMX pipe and back takes at *least* 6 cycles
>3) this is not really a 64 bit operation
>4) I tried AMDs 'optimized' version and it turned out to be much slower than the
>simple C hack
>
>I'd be very interested in any performance numbers you have.
>
>anthony
Hi anthony,
Yes, that may be true for single popcounts. I use this single one very rarely
but a lot of none inlined parallel versions to count the bits of up to four
bitboards simultaniosly (eg. counting center/king area weighted attacks), where
all 8-mmx registers are used and the instructions are scheduled in a proper way,
to break dependency chains. That gains something
(also only one final, dead slow vector path movd eax, mm0).
For bitboards with low population probability i often use some inlines like
isBitCountGreaterOne or an assembler loop version below.
Of course, general purpose register instructions are faster on x86-32, but if
you don't have enaugh of these registers ;-)
Gerd
static __forceinline int _fastcall BitCount8 (BitBoard bb)
{
__asm
{
mov ecx, dword ptr bb
xor eax, eax
test ecx, ecx
jz l1
l0: lea edx, [ecx-1]
inc eax
and ecx, edx
jnz l0
l1: mov ecx, dword ptr bb+4
test ecx, ecx
jz l3
l2: lea edx, [ecx-1]
inc eax
and ecx, edx
jnz l2
l3:
}
}
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.