Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitscan Updated

Author: Dezhi Zhao

Date: 12:55:37 02/16/03

Go up one level in this thread


On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:

>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>
>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>
>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>
>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>
>>>>Regards,
>>>>Gerd
>>
>>What -did- you use before? After thinking about it, I don't see how these
>>routines could make such an impact. It would be very difficult to shrink them
>>further.
>>
>>Have you tried not inlining them? That would make the resulting code smaller.
>>
>>-Matt
>
>Hi Matt,
>
>this one (posted already x times) is the routine i compared with:
>
>__forceinline
>UINT BitSearchAndReset(BitBoard &bb)
>{
>	__asm
>	{
>		mov		ebx, [bb]
>		xor		eax, eax
>		mov		edx, 1
>		bsf		ecx, [ebx]
>		jnz		found
>		bsf		ecx, [ebx + 4]
>		lea		ebx, [ebx + 4]
>		xor		eax, 32
>	found:
>		shl		edx, cl
>		xor		eax, ecx
>		xor		[ebx], edx
>	}
>
>}
>
>To my surprise this conditional routine using four registers is faster than this
>short unconditional one but using three vector path instructions:
>
>__forceinline
>UINT BitSearchAndReset(BitBoard &bb)
>{
>	__asm
>	{
>		mov		edx, [bb]
>		bsf		eax, [edx+4]
>		xor		eax, 32
>		bsf		eax, [edx]
>		btr		[edx],eax
>	}
>}
>
>With none inlining, your routines and even walters magic lsb64 lookup routine
>are all about the same in average, but the nps and solution time performance is
>significantly worse (4-10%).
>
>Regards,
>Gerd

This sounds quite nomral to me. If you choose to inline them, you shorter
version will be faster.  Matt's functions are optimal in context of functions.
If you choose to inline them, you have to factor in register usages and out of
order execuation. Your shoter version uses fewer registers and the processor can
often find OOE oppotunities to hide the 2nd bsf instruction.

Of course, if you go hand inlining Mat's code into your code (rewrite all the
callers in asm), you will see the speedup. And you find even more pairing
oppotunites:)

Have fun

dzhao



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.