Computer Chess Club Archives


Search

Terms

Messages

Subject: Fast 3DNow! BitScan

Author: Gerd Isenberg

Date: 12:56:58 12/01/02


Hi all,

The fastest 64bit - BitScan with Athlons so far!

The AMD-3DNow! Instruction PI2FD does a packed 32-bit (signed) integer to
floating-point (IEEE 32-bit) conversion. The PI2FD (direct path, 4 cycles
latency) instruction performs the following operations:

mmreg1 [31: 0] == float(mmreg2/mem64 [31: 0])
mmreg1 [63:32] == float(mmreg2/mem64 [63:32])

Bit 31 of the float is the sign bit, the eight bits 30:23 are the biased
exponent and the remaining 23 bits are the significand.

If you pass single-bit bitboards through this instruction, the bitscan idea
becomes obvious clear due to the biased base two exponent.

bit 0-63 PI2FD
00000001 3F800000
00000002 40000000
00000004 40800000
00000008 41000000
00000010 41800000
..
40000000 4E800000
80000000 CF000000 ! sign

Some additional shift, substraction and masks, and there is faster bitscan
routine than with two 32-bit bsf-instructions!

// Athlon only due to pswapd and PI2FD
// precondition: BBs has exactly one bit set
// may be used as bitscan with (b&-b)
//===========================================
int getBitIndex(BitBoard singleBit)
{
	__asm
	{
		pxor	mm2, mm2	; 0
		movd	  mm0, [singleBit]
		punpckldq mm0, [singleBit+4]
		pcmpeqd	mm4, mm4	; -1
		pcmpeqd	mm3, mm3	; -1
		pcmpeqd	mm2, mm0	; -1-mask for the zero dword
		PI2FD	mm1, mm0	; 3f8..,400..
		psrlq	mm4, 63		; 00:01
		pxor	mm2, mm3	; -1-mask for the none zero dword
		psrld	mm3, 25		; 7f:7f
		psrld	mm1, 23		; 3f8.. to 7f
		psllq	mm4, 32+5	; 20:00 to add 32 to high dword
		psubd	mm1, mm3	; - 7f:7f
		pand	mm1, mm3	; & 7f:7f
		por	mm1, mm4	; + 32 to high dword
		pand	mm1, mm2	; mask apropriate low- or high dword
		pswapd	mm2, mm1	; swap dwords
		por	mm1, mm2	; combine low and high dword
		movd	eax, mm1
	}
}

Even if there is some effort to calculate the consts 7f:7f and 20:0, this is
shlightly faster than the bsf pair, even with (b&-b) outside the functions!

A parallized verions with two bitboards (forBB, toBB) does very well, only 12.5
seconds for 10^9 runs, faster than a single bsf pair.

Cheers,
Gerd



This page took 0.05 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.