Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: BitScan with reset - not so impressive with 3DNow!

Author: Gerd Isenberg

Date: 07:09:31 12/03/02

Go up one level in this thread


On December 03, 2002 at 06:40:36, Gian-Carlo Pascutto wrote:

>On December 02, 2002 at 19:43:15, Gerd Isenberg wrote:
>
>>Congrats, Walter!!
>>
>>10-bit pattern	bsf	PI2FD	btr	c      LSB_64
>>0x0000000011111133	15.3	18.0	19.1	22.8    17.8
>>0x1010111010101110	19.7	18.5	19.6	23.4    17.8
>>0x1111113300000000	20.6	18.0	19.1	22.8    17.8
>
>Does this mean that this is the fastest currently known
>FindFirstAndRemove *and* that it doesn't need assembly?

Hi Gian-Carlo,

Yes, it seems so. Walter's "magic" routine has only cheap instructions and one
lookup with a cache friendly table of 154 Bytes.

Even if the conditional bsf-approach is obviously faster, if we have one-bit's
in the lower half, i hate this kind of asymmetry.

Let's see how fast Hammer's bsf reg64,reg64 will be.
If it sucks like Athlon, there are a lot of good alternatives.
I guess that with 64-bit arithmetic, to do a fast single bit isolation the trick
with 64int to 32bit float conversion is promising.

What i like on Walter's approach is the kind of folding down the sequence of
consecutive trailing one's  (SingleBit - 1) from one 64-bit word to one 32 bit
word without loosing any information.

eg
0000..011|1111...1111
          1111...1100

Then this wonderful magic xor, some parallel prefix shifts with add and sub, to
get a unique value in this 0..153 range. Great respect to Walter, to invent
these algorthm.

Looking for something similar, with two bitboards (fromBB, toBB of valid moves)
to get unique value in a range of let say 0 to 64*28-1 or 2047!


>
>Impressive for sure!
>
>Any similar tricks for lightning-speed popcounting?

I fear with random bits it's not so easy, but who knows...

>
>What's the fastest sequence you've got for popcounting so
>far? (preferably one that doesn't depend too much on new
>instructions)

I use the Athlon MMX-routine provided by AMD's Optimization Guide based on this
c-routine:

UINT32 BitCount (BitBoard bb)
{
	static const UINT32 m1 = 0x55555555;
	static const UINT32 m2 = 0x33333333;
	static const UINT32 m3 = 0x0f0f0f0f;
	UINT32 l = (UINT32) bb;
	UINT32 h = (UINT32)(bb>>32);
	l = l - ((l>>1) & m1);
	h = h - ((h>>1) & m1);
	l = (l & m2) + ((l>>2) & m2);
	h = (h & m2) + ((h>>2) & m2);
	l = (l & m3) + ((l>>4) & m3);
	h = (h & m3) + ((h>>4) & m3);
	l = (l & 0x0ffff) + (l>>16);
	h = (h & 0x0ffff) + (h>>16);
	return ((l & 0x0ff) + (l>>8) + (h & 0x0ff) + (h>>8));
}

For rarely populated bitboards (< 8), i use a count loop with bb &= bb-1.
In this context i use also boolean inlnes like

BOOL IsBitCountGreaterOne(BitBoard bb);

for statements like this

if ( BitCount(bb) > 1 )

I prefere methods without table-lookups, even if they are a bit faster in this
kind of dumb loop measurement. But on the other hand these excessive
computations, keeping most pipes busy, are not so hyperthreading friendly.

Cheers,
Gerd


>--
>GCP



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.