Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitscan Updated

Author: Matt Taylor

Date: 17:12:50 02/16/03

Go up one level in this thread


On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:

>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>
>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>
>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>
>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>
>>>>Regards,
>>>>Gerd
>>
>>What -did- you use before? After thinking about it, I don't see how these
>>routines could make such an impact. It would be very difficult to shrink them
>>further.
>>
>>Have you tried not inlining them? That would make the resulting code smaller.
>>
>>-Matt
>
>Hi Matt,
>
>this one (posted already x times) is the routine i compared with:
>
>__forceinline
>UINT BitSearchAndReset(BitBoard &bb)
>{
>	__asm
>	{
>		mov		ebx, [bb]
>		xor		eax, eax
>		mov		edx, 1
>		bsf		ecx, [ebx]
>		jnz		found
>		bsf		ecx, [ebx + 4]
>		lea		ebx, [ebx + 4]
>		xor		eax, 32
>	found:
>		shl		edx, cl
>		xor		eax, ecx
>		xor		[ebx], edx
>	}
>
>}

At least 30 bytes. (Crude estimation; I did not take into account that bb
resolves to something like esp+x or ebp+x.)

>To my surprise this conditional routine using four registers is faster than this
>short unconditional one but using three vector path instructions:
>
>__forceinline
>UINT BitSearchAndReset(BitBoard &bb)
>{
>	__asm
>	{
>		mov		edx, [bb]
>		bsf		eax, [edx+4]
>		xor		eax, 32
>		bsf		eax, [edx]
>		btr		[edx],eax
>	}
>}

Yes, each bsf instruction costs 8 cycles, and btr costs another 9.

>With none inlining, your routines and even walters magic lsb64 lookup routine
>are all about the same in average, but the nps and solution time performance is
>significantly worse (4-10%).
>
>Regards,
>Gerd

Yes, the 10+ cycles latency from the function call would tend to minimize the
benefit of 3 saved cycles. :-)

I can't figure out why performance would be better looping the bsf instruction
rather than using cmov and executing it once. The code size isn't much
different; perhaps there is some subtle side-effect at work here? Perhaps my
choice of esi to store bb in was a bad choice. I usually prefer esi to ebx
because esi can be used in string instructions and VC likes to use ebx when
aligning variables on the stack.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.