Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitscan Updated

Author: Matt Taylor

Date: 17:08:00 02/16/03

Go up one level in this thread


On February 16, 2003 at 15:55:37, Dezhi Zhao wrote:

>On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:
>
>>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>>
>>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>>
>>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>>
>>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>>
>>>>>Regards,
>>>>>Gerd
>>>
>>>What -did- you use before? After thinking about it, I don't see how these
>>>routines could make such an impact. It would be very difficult to shrink them
>>>further.
>>>
>>>Have you tried not inlining them? That would make the resulting code smaller.
>>>
>>>-Matt
>>
>>Hi Matt,
>>
>>this one (posted already x times) is the routine i compared with:
>>
>>__forceinline
>>UINT BitSearchAndReset(BitBoard &bb)
>>{
>>	__asm
>>	{
>>		mov		ebx, [bb]
>>		xor		eax, eax
>>		mov		edx, 1
>>		bsf		ecx, [ebx]
>>		jnz		found
>>		bsf		ecx, [ebx + 4]
>>		lea		ebx, [ebx + 4]
>>		xor		eax, 32
>>	found:
>>		shl		edx, cl
>>		xor		eax, ecx
>>		xor		[ebx], edx
>>	}
>>
>>}
>>
>>To my surprise this conditional routine using four registers is faster than this
>>short unconditional one but using three vector path instructions:
>>
>>__forceinline
>>UINT BitSearchAndReset(BitBoard &bb)
>>{
>>	__asm
>>	{
>>		mov		edx, [bb]
>>		bsf		eax, [edx+4]
>>		xor		eax, 32
>>		bsf		eax, [edx]
>>		btr		[edx],eax
>>	}
>>}
>>
>>With none inlining, your routines and even walters magic lsb64 lookup routine
>>are all about the same in average, but the nps and solution time performance is
>>significantly worse (4-10%).
>>
>>Regards,
>>Gerd
>
>This sounds quite nomral to me. If you choose to inline them, you shorter
>version will be faster.  Matt's functions are optimal in context of functions.
>If you choose to inline them, you have to factor in register usages and out of
>order execuation. Your shoter version uses fewer registers and the processor can
>often find OOE oppotunities to hide the 2nd bsf instruction.

First let me say that we are specifically talking about Athlon.

On Athlon, VectorPath instructions (like bsf) decode into a series of macro-ops.
OOOE cannot hide the 8 cycles of latency this produces because the microcoded
bsf instruction consumes most (all?) of the execution resources. At times I have
noted the bsf instruction taking only 7 cycles. Perhaps this is an effect of
OOOE.

Register usage is a non-issue because VC assumes that all general registers are
trashed over inline assembly. I can use 1 register of 5 registers; it makes no
difference. Besides -- he and I both use 4 registers.

>Of course, if you go hand inlining Mat's code into your code (rewrite all the
>callers in asm), you will see the speedup. And you find even more pairing
>oppotunites:)

Isn't that what the compiler is already doing?

>Have fun
>
>dzhao

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.