Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitscan Updated

Author: Matt Taylor

Date: 00:58:13 02/17/03

Go up one level in this thread


On February 16, 2003 at 22:05:38, Dezhi Zhao wrote:

>On February 16, 2003 at 20:08:00, Matt Taylor wrote:
>
>>On February 16, 2003 at 15:55:37, Dezhi Zhao wrote:
>>
>>>On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:
>>>
>>>>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>>>>
>>>>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>>>>
>>>>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>>>>
>>>>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>>>>
>>>>>>>Regards,
>>>>>>>Gerd
>>>>>
>>>>>What -did- you use before? After thinking about it, I don't see how these
>>>>>routines could make such an impact. It would be very difficult to shrink them
>>>>>further.
>>>>>
>>>>>Have you tried not inlining them? That would make the resulting code smaller.
>>>>>
>>>>>-Matt
>>>>
>>>>Hi Matt,
>>>>
>>>>this one (posted already x times) is the routine i compared with:
>>>>
>>>>__forceinline
>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>{
>>>>	__asm
>>>>	{
>>>>		mov		ebx, [bb]
>>>>		xor		eax, eax
>>>>		mov		edx, 1
>>>>		bsf		ecx, [ebx]
>>>>		jnz		found
>>>>		bsf		ecx, [ebx + 4]
>>>>		lea		ebx, [ebx + 4]
>>>>		xor		eax, 32
>>>>	found:
>>>>		shl		edx, cl
>>>>		xor		eax, ecx
>>>>		xor		[ebx], edx
>>>>	}
>>>>
>>>>}
>>>>
>>>>To my surprise this conditional routine using four registers is faster than this
>>>>short unconditional one but using three vector path instructions:
>>>>
>>>>__forceinline
>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>{
>>>>	__asm
>>>>	{
>>>>		mov		edx, [bb]
>>>>		bsf		eax, [edx+4]
>>>>		xor		eax, 32
>>>>		bsf		eax, [edx]
>>>>		btr		[edx],eax
>>>>	}
>>>>}
>>>>
>>>>With none inlining, your routines and even walters magic lsb64 lookup routine
>>>>are all about the same in average, but the nps and solution time performance is
>>>>significantly worse (4-10%).
>>>>
>>>>Regards,
>>>>Gerd
>>>
>>>This sounds quite nomral to me. If you choose to inline them, you shorter
>>>version will be faster.  Matt's functions are optimal in context of functions.
>>>If you choose to inline them, you have to factor in register usages and out of
>>>order execuation. Your shoter version uses fewer registers and the processor can
>>>often find OOE oppotunities to hide the 2nd bsf instruction.
>>
>>First let me say that we are specifically talking about Athlon.
>>
>>On Athlon, VectorPath instructions (like bsf) decode into a series of macro-ops.
>>OOOE cannot hide the 8 cycles of latency this produces because the microcoded
>>bsf instruction consumes most (all?) of the execution resources. At times I have
>>noted the bsf instruction taking only 7 cycles. Perhaps this is an effect of
>>OOOE.
>>
>I do not have an Athlon. However I assume it implement bsf, bsr, btr the same as
>P4. Yes, it is dfficult to hide an instruction of 8 cycle latency. You can
>manage to hide part of it sometimes, if you have another slow instruction in the
>stream, e.g. a shift operation.

Athlon uses a barrel shifter. Latency of 1 clock. I may be mistaken, but I
believe it has 1 attached to each of its 3 IEUs. Throughput of 3 shifts per
clock.

I don't know how the P4 implements bsf, bsr, and btr, but I know how the Athlon
implements the first two. For bsf/bsr, I am told it uses a binary search
algorithm. I do not know the exact implementation, but they may do something
like x & -x to trap the LSB (for bsf) and then use conditional masks. The bsr
instruction would be similar, but a different method for trapping the MSB.

I actually timed the algorithm Gerd posted. I don't recall exact numbers, but
there was very little overlap between the two bsf instructions. The total
execution time was at least 20 cycles. Both bsf instructions are listed as 8
cycles, and the btr is listed as 9 cycles. Another sequence of code was 1 cycle
faster and used 2 bsf instructions with a shift to reset the bit. The DirectPath
instructions used to emulate btr took approximately 5 cycles to execute.

>>Register usage is a non-issue because VC assumes that all general registers are
>>trashed over inline assembly. I can use 1 register of 5 registers; it makes no
>>difference. Besides -- he and I both use 4 registers.
>>
>
>If you watch the VC-generated code with inlined asm, you will find register
>spill-over. The more register the inline asm uses, the more spill-over. VC does
>not have a way to name a variable register.

No it doesn't, and that's a shame. However, at 3 registers you already have
spill-over. It really doesn't make a big difference...

I don't remember what I ended up putting in lsb.h, but I know at one point the
routine used 6 registers but preserved 3.

>>>Of course, if you go hand inlining Mat's code into your code (rewrite all the
>>>callers in asm), you will see the speedup. And you find even more pairing
>>>oppotunites:)
>>
>>Isn't that what the compiler is already doing?
>
>No. I mean the compiler does a bad job when it comes over inline asm. You can
>try it, I bet you wont like the code it generates. It generates some unnecessary
>memory operations to free registers.
>
>>
>>>Have fun
>>>
>>>dzhao

Yes, but both cases were inline assembly...

Throwing stuff out to memory isn't that bad. It can execute out-of-order, and it
should usually hit the L1 when referencing the stack. Reading it back in is the
slow part.

I personally like GCC's form which allows hints to be given to the compiler as
well as variable registers. I wish it were a little less idiosyncratic, though.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.