Computer Chess Club Archives


Search

Terms

Messages

Subject: Modulo verus BitScan and MMX-PopCount

Author: Gerd Isenberg

Date: 16:19:02 11/29/02


Hi all Bitboarders,

Recently there was an interesting posting from Joel Venessan. He introduced a
bitscan idea via 64-bit mod 67 to get an unique value for each single bit
exclusively set in a bitboard, most likely determined by a previous (b & -b)
statement. With two small arrays one may map board- and/or 0x88-coordinates to
mod67-coordinates and vice versa. But the div/mod instruction is too slow, so i
was looking to Paul Hsieh's "Integer division and modulus by constants" site:

http://www.azillionmonkeys.com/qed/adiv.html

This formula for k*(2^t) with odd k (2^t is power of two here, not xor :-)

    a % (k*2^t) == a - (((((int)(((2^32)+k-1)/k)) * (a) / (2^32))) & (-2^t))*k

is only valid for 32 bit integers. I played a bit around and this is the result
for mod 67, don't know whether there is something smarter on 32 bit machines
(k is odd, t is zero, & (-1) not necessary):

-------------------------------------------------------------------------------
unsigned int mod67 (unsigned __int64 a)
{
	unsigned __int64 h = 0x03D226357E16ECE6 * (a >> 32); // (2^64+67-1)/67
	unsigned __int64 l = 0x03D22636 * (a & 0xffffffff);  // (2^32+67-1)/67
	int modulus = (int)(a - (__int64)((h+l)>>32) * 67);
	return (unsigned int) modulus + (modulus < 0 ) * 67;
}
-------------------------------------------------------------------------------

Really surprised by the msc compiler, rather compact code (the mul 67 are some
lea-instructions). On my Athlon it's really fast and may be an alternative to
the slow "bitscan" instruction pair which are vector path instructions and
require inline assembly. The routine is about three times faster than using the
pure 64-bit modulo operator %.

Whether the function is really exact, due to some rounding errors of the
constants, exceeds my math skills.

I tried a lot of values in some loops, comparing with %-operator, and found no
error so far. At least the function is "safe" enaugh for the desired purpose,
doing a kind of bitscan.

---

The mod5811 is nice for getting unique values of combined (ored) "from"- and
"to"-bitboards, but the range is rather huge and requires arrays of 2*5800 size.
It's shlightly faster due to smaller consts, i guess.

-------------------------------------------------------------------------------
unsigned int mod5811 (unsigned __int64 a)
{
	unsigned __int64 h = 0xB4725D7BBFAE4 * (a >> 32); // (2^64+5811-1)/5811
	unsigned __int64 l = 0xB4726 * (a & 0xffffffff);  // (2^32+5811-1)/5811
	int modulus = (int)(a - (__int64)((h+l)>>32) * 5811);
	if ( modulus < 0 )
		modulus += 5811;
	return (unsigned int) modulus;
}
-------------------------------------------------------------------------------

Another well known bitscan idea for singular bitboards is to use population
count algorithms after subtracting one:

bitidx    76543210
bit 5 set 00100000
-1        00011111 and count the "ones".

Based on the mmx-popcount from AMD's Athlon Optimization Guide i tried a version
to count the popularity of two bitboards simultaniously. On my current K7-2.1G i
found that subtracting a one-quadword with mmx (which only supports dwords) is
faster than in C (the substracted qwords have additional store/load). The
routine is faster than the bsf-pair (50% high/low- word probability).

Times in seconds for 10^9 runs in a loop K7-2.1G:
mmx-parallel popcount     19.7
two asm bsf pairs v1      24.3
mod5811                   26.4
two asm bsf pairs v2      26.6
64bit % 5811              72.5

Cheers,
Gerd



-------------------------------------------------------------------------------

// Athlon only due to pswapd and psadbw
// precondition: BBs have exactly one bit set
//===========================================
int dualBitSearchMMX(BitBoard bb1, BitBoard bb2)
{
	static const struct
	{
		BitBoard C33;
		BitBoard highone;
		BitBoard C55;
		BitBoard C0F;
	} BitBoardConsts =
	{
		0x3333333333333333,	// C33
		0x0000000100000000,	// highone
		0x5555555555555555,	// C55
		0x0f0f0f0f0f0f0f0f,	// C0F
	};
	__asm
	{
		pxor		mm3, mm3
		pxor		mm4, mm4
		movd		mm0, [bb1]
		punpckldq	mm0, [bb1+4]
		movd		mm2, [bb2]
		punpckldq	mm2, [bb2+4]
		lea		eax, [BitBoardConsts]
		// qword sub one, hammer or p4 have psubq,
                // but psubd must consider overflow
		pcmpeqd mm3, mm0 ; sq1 < 32 ? 0xf..0.. : 0x0..f..
		pcmpeqd mm4, mm2 ; sq2 < 32 ? 0xf..0.. : 0x0..f..
		pswapd	mm1, [eax].highone	; low 1
		pandn	mm3, [eax].highone	; as borrow if sq1 >= 32
		pandn	mm4, [eax].highone	; as borrow if sq2 >= 32
		psubd	mm0, mm1	; bb1 - 1
		psubd	mm2, mm1	; bb2 - 1
		psubd	mm0, mm3	; bb1 - borrow*(1<<32)
		psubd	mm2, mm4	; bb2 - borrow*(1<<32)
		// two parallel popCounts with mm0, mm2
		movq	mm1, mm0
		movq	mm3, mm2
		psrld	mm0, 1
		psrld	mm2, 1
		pand	mm0, [eax].C55
		pand	mm2, [eax].C55
		psubd	mm1, mm0
		psubd	mm3, mm2
		movq	mm0, mm1
		movq	mm2, mm3
		psrld	mm1, 2
		psrld	mm3, 2
		pand	mm0, [eax].C33
		pand	mm2, [eax].C33
		pand	mm1, [eax].C33
		pand	mm3, [eax].C33
		paddd	mm0, mm1
		paddd	mm2, mm3
		movq	mm1, mm0
		movq	mm3, mm2
		psrld	mm0, 4
		psrld	mm2, 4
		paddd	mm0, mm1
		paddd	mm2, mm3
		pxor	mm1, mm1
		pand	mm0, [eax].C0F
		pand	mm2, [eax].C0F
		psadbw	mm0, mm1	; add all bytes
		psadbw	mm2, mm1	; add all bytes
		psllq	mm0, 6		; * 64
		por	mm0, mm2
	        // 64*popCount(bb1-1) + popCount(bb2-1)
		movd	eax, mm0
	}
}

the rountines i compared with:

int BitSearch_v1 (BitBoard bb)
{
	__asm
	{
		bsf	eax,[bb]
		jnz	found
		bsf	eax,[bb+4]
		xor	eax, 32
	found:
	}
}

int BitSearch_v2(BitBoard bb)
{
	__asm
	{
		bsf	eax,[bb+4]
		xor	eax, 32
		bsf	eax,[bb]
	}
}









This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.