Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Fast 3DNow! BitScan, one more faster

Author: Sune Fischer

Date: 08:53:37 12/02/02

Go up one level in this thread


On December 02, 2002 at 08:59:38, Gerd Isenberg wrote:

>On December 02, 2002 at 07:19:06, Sune Fischer wrote:
>
>>On December 01, 2002 at 17:05:06, Gerd Isenberg wrote:
>>
>>>oups, something shorter and faster:
>>>
>>>int getBitIndex(BitBoard singleBit)
>>>{
>>>	__asm
>>>	{
>>>		pxor	mm2, mm2	; 0
>>>		movd		mm0, [singleBit]
>>>		punpckldq	mm0, [singleBit+4]
>>>		pcmpeqd	mm6, mm6	; -1
>>>		pxor	mm7, mm7	; 0
>>>		pcmpeqd	mm2, mm0	; ~mask of the none zero dword
>>>		PI2FD	mm1, mm0	; 3f8..,400..
>>>		pxor	mm2, mm6	; mask of the none zero dword
>>>		psrlq	mm6, 63		; 01
>>>		psrld	mm1, 23		; 3f8 to 7f
>>>		psrld	mm2, 25		; 7f mask
>>>		psllq	mm6, 32+5	; 20:00
>>>		psubd	mm1, mm2	; - 7f mask
>>>		por	mm1, mm6	; + 32 in high dword
>>>		pand	mm1, mm2	; & 7f mask
>>>		psadbw	mm1, mm7	; add all bytes
>>>		movd	eax, mm1
>>>	}
>>>}
>>
>>This is great, I will try it.
>>
>>What I really need is GetFirstBitAndReset() functions.
><snip>
>
>>Is it possible to make it xor out the bit it found too?
>>Perhaps it is too complicated, in my case I think b&(-b) needs to be in
>>assembler, so that the precondition is removed entirely.
>
>
>Hi Sune,
>
>hmm, lets try it on the fly (i'm at work, so not tested and optimized so far):
>
>int bitSearchAndReset(BitBoard &bb)
>{
>    BitBoard lsb = bb & -((__int64)bb);
>    bb ^= lsb;
>    return getBitIndex(lsb); // should be inlined
>}
>
>With mmx there is some trouble with the 64-bit twos-complement, because there is
>no paddq:
>
>int bitSearchAndReset(BitBoard &bb)
>{
>	__asm
>	{
>		pxor	mm2, mm2	; 0
>		pxor	mm3, mm3	; 0
>		pcmpeqd	mm6, mm6	; -1
>		pcmpeqd	mm1, mm1	; -1
>
>                mov     eax, [bb]
>		movq	mm0, [eax]      ; assume properly aligned bitboard
>		psrlq	mm6, 63		; 00:01
>                pxor    mm1, mm0	; ~bb, ones complement
>                paddd   mm1, mm6        ; +1 but no overflow to high dword
>		psllq	mm6, 32         ; 01:00
>		pcmpeqd	mm3, mm1        ; look whether low dword is zero due to overflow
>                psllq   mm3, 1          ; shift carry to the right place
>                pand    mm3, mm6        ;  ... and mask 1
>                paddd   mm1, mm3        ; add possible overflow, no we have -bb
>                pand    mm0, mm1	; lsb = bb & -bb
>		pxor	mm7, mm7	; 0
>		pxor	[eax], mm0      ; reset lsb in bb
>
>		pcmpeqd	mm2, mm0	; ~mask of the none zero dword
>		PI2FD	mm1, mm0	; 3f8..,400..
>		pxor	mm2, mm6	; mask of the none zero dword
>		psrld	mm1, 23		; 3f8 to 7f
>		psrld	mm2, 25		; 7f mask
>		psllq	mm6, 5		; 20:00
>		psubd	mm1, mm2	; - 7f mask
>		por	mm1, mm6	; + 32 in high dword
>		pand	mm1, mm2	; & 7f mask
>		psadbw	mm1, mm7	; add all bytes
>		movd	eax, mm1
>	}
>}
>
>>Is it possible to do a similar optimization on 32 bit?
>
>may be...
>
>>
>>I have this:  oups, there is a serious error!
>>uint32  FirstBit32(uint32 bitmap)
>>{
>>	__asm
>>	{
>>		bsf	eax, [bitmap]
>>		jnz	done
>>		mov	eax, 0 // That is even true if bit 0 is set !!!
>>	done:
>>	}
>>}
>
>should be:
>
>uint32  FirstBit32(uint32 bitmap)
>{
>	__asm
>	{
>		bsf	eax, [bitmap]
>		jnz	done
>		mov	eax, 0xffffffff
>	done:
>	}
>}

It would crash if I tried continuing with that value, so it has to be a value
in the range 0-31.
I know it is no good for error detection then, but I don't use that anyway,
hence my question for the precoditioning :)

>or
>
>int FirstBit32(uint32 bitmap)
>{
>	__asm
>	{
>		bsf	eax, [bitmap]
>		jnz	done
>		mov	eax, -1
>	done:
>	}
>}

signed/unsigned mix is just slowing things down. I always use unsigned when
possible, it gives a measurable boost! :)

>>
>>I would like functions that precondition the bitboard is not empty, ie. that at
>>least 1 bit is set. The little function above isn't optimized for that, how do I change it?

so would this work?
uint32 FirstBit32(uint32 bitmap)
{
	__asm
	{
		bsf	eax, [bitmap]
	}
}

?
I do not want redundant assembler lines, I already know the bitmaps (32 or 64
bits) aren't 0 because they are running inside while loops, so I hope there is
no testing of that?

>>Thanks :)
>>-S.
>
>or this one for Athlon with PI2FD with reset of the found bit:
>
>// should return < 0 (0x80000000) if bitmap is zero
>// not tested !!
>
>int FirstBit32WithReset(unsigned int &bitmap)
>{
>	__asm
>	{
>		mov	ebx, [bitmap]
>		xor	edx, edx        ; 0
>		mov	eax, [ebx]      ; b
>		sub	edx, eax        ; -b
>		and	eax, edx        ; b & -b
>                xor     [ebx], eax      ; reset bit, if any
>		movd    mm0, eax        ; hmm... vector path
>		PI2FD	mm0, mm0	; 0-0; 1-3f8.., 2-400.., 4-408
>		movd	eax, mm0
>		shr     eax, 23         ; 0, 7f, 80, 81...
>		sub     eax, 0x7f
>		and     eax, 0x8000001f
>	}
>}
>
>But there are a lot of register dependencies...

Hmm, may be, I just thought it would be "nice" to not have that x&=x-1
everywhere. It's more compact an a little bit less errorprone, sometimes I
forget and have to debug an infinite loop stall.

I will try them, thanks :)

-S.

>Gerd



This page took 0.02 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.