Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitscan Updated

Author: Dezhi Zhao

Date: 10:34:11 02/17/03

Go up one level in this thread


On February 17, 2003 at 03:58:13, Matt Taylor wrote:

>On February 16, 2003 at 22:05:38, Dezhi Zhao wrote:
>
>>On February 16, 2003 at 20:08:00, Matt Taylor wrote:
>>
>>>On February 16, 2003 at 15:55:37, Dezhi Zhao wrote:
>>>
>>>>On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:
>>>>
>>>>>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>>>>>
>>>>>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>>>>>
>>>>>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>>>>>
>>>>>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>>>>>
>>>>>>>>Regards,
>>>>>>>>Gerd
>>>>>>
>>>>>>What -did- you use before? After thinking about it, I don't see how these
>>>>>>routines could make such an impact. It would be very difficult to shrink them
>>>>>>further.
>>>>>>
>>>>>>Have you tried not inlining them? That would make the resulting code smaller.
>>>>>>
>>>>>>-Matt
>>>>>
>>>>>Hi Matt,
>>>>>
>>>>>this one (posted already x times) is the routine i compared with:
>>>>>
>>>>>__forceinline
>>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>>{
>>>>>	__asm
>>>>>	{
>>>>>		mov		ebx, [bb]
>>>>>		xor		eax, eax
>>>>>		mov		edx, 1
>>>>>		bsf		ecx, [ebx]
>>>>>		jnz		found
>>>>>		bsf		ecx, [ebx + 4]
>>>>>		lea		ebx, [ebx + 4]
>>>>>		xor		eax, 32
>>>>>	found:
>>>>>		shl		edx, cl
>>>>>		xor		eax, ecx
>>>>>		xor		[ebx], edx
>>>>>	}
>>>>>
>>>>>}
>>>>>
>>>>>To my surprise this conditional routine using four registers is faster than this
>>>>>short unconditional one but using three vector path instructions:
>>>>>
>>>>>__forceinline
>>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>>{
>>>>>	__asm
>>>>>	{
>>>>>		mov		edx, [bb]
>>>>>		bsf		eax, [edx+4]
>>>>>		xor		eax, 32
>>>>>		bsf		eax, [edx]
>>>>>		btr		[edx],eax
>>>>>	}
>>>>>}
>>>>>
>>>>>With none inlining, your routines and even walters magic lsb64 lookup routine
>>>>>are all about the same in average, but the nps and solution time performance is
>>>>>significantly worse (4-10%).
>>>>>
>>>>>Regards,
>>>>>Gerd
>>>>
>>>>This sounds quite nomral to me. If you choose to inline them, you shorter
>>>>version will be faster.  Matt's functions are optimal in context of functions.
>>>>If you choose to inline them, you have to factor in register usages and out of
>>>>order execuation. Your shoter version uses fewer registers and the processor can
>>>>often find OOE oppotunities to hide the 2nd bsf instruction.
>>>
>>>First let me say that we are specifically talking about Athlon.
>>>
>>>On Athlon, VectorPath instructions (like bsf) decode into a series of macro-ops.
>>>OOOE cannot hide the 8 cycles of latency this produces because the microcoded
>>>bsf instruction consumes most (all?) of the execution resources. At times I have
>>>noted the bsf instruction taking only 7 cycles. Perhaps this is an effect of
>>>OOOE.
>>>
>>I do not have an Athlon. However I assume it implement bsf, bsr, btr the same as
>>P4. Yes, it is dfficult to hide an instruction of 8 cycle latency. You can
>>manage to hide part of it sometimes, if you have another slow instruction in the
>>stream, e.g. a shift operation.
>
>Athlon uses a barrel shifter. Latency of 1 clock. I may be mistaken, but I
>believe it has 1 attached to each of its 3 IEUs. Throughput of 3 shifts per
>clock.
>
>I don't know how the P4 implements bsf, bsr, and btr, but I know how the Athlon
>implements the first two. For bsf/bsr, I am told it uses a binary search
>algorithm. I do not know the exact implementation, but they may do something
>like x & -x to trap the LSB (for bsf) and then use conditional masks. The bsr
>instruction would be similar, but a different method for trapping the MSB.
>
>I actually timed the algorithm Gerd posted. I don't recall exact numbers, but
>there was very little overlap between the two bsf instructions. The total
>execution time was at least 20 cycles. Both bsf instructions are listed as 8
>cycles, and the btr is listed as 9 cycles. Another sequence of code was 1 cycle
>faster and used 2 bsf instructions with a shift to reset the bit. The DirectPath
>instructions used to emulate btr took approximately 5 cycles to execute.
>
>>>Register usage is a non-issue because VC assumes that all general registers are
>>>trashed over inline assembly. I can use 1 register of 5 registers; it makes no
>>>difference. Besides -- he and I both use 4 registers.
>>>
>>
>>If you watch the VC-generated code with inlined asm, you will find register
>>spill-over. The more register the inline asm uses, the more spill-over. VC does
>>not have a way to name a variable register.
>
>No it doesn't, and that's a shame. However, at 3 registers you already have
>spill-over. It really doesn't make a big difference...
>
>I don't remember what I ended up putting in lsb.h, but I know at one point the
>routine used 6 registers but preserved 3.
>
>>>>Of course, if you go hand inlining Mat's code into your code (rewrite all the
>>>>callers in asm), you will see the speedup. And you find even more pairing
>>>>oppotunites:)
>>>
>>>Isn't that what the compiler is already doing?
>>
>>No. I mean the compiler does a bad job when it comes over inline asm. You can
>>try it, I bet you wont like the code it generates. It generates some unnecessary
>>memory operations to free registers.
>>
>>>
>>>>Have fun
>>>>
>>>>dzhao
>
>Yes, but both cases were inline assembly...
>

Yes. You can only tell the difference by reading the asm outputs .

>Throwing stuff out to memory isn't that bad. It can execute out-of-order, and it
>should usually hit the L1 when referencing the stack. Reading it back in is the
>slow part.
>

It is not very bad, but it is worse than you expected. Besides saving and
loading registers from/to stack around the inlined asm code, the complier
usually uses more registers for the caller function than necessary, thus more
push/pop operations. And more register usage reduces the chances of register
renaming as P4 manual suggested. I noticed several instances of much less
optimized code with inlined asm from VC at function level, compared to function
call to asm function. If the inlined asm resides in a loop, things get even
worse. Sometimes a function call can even beat the inlined version.

>I personally like GCC's form which allows hints to be given to the compiler as
>well as variable registers. I wish it were a little less idiosyncratic, though.
>
>-Matt

I agree. I just wonder why VC compiler does not incorperate such hint, since it
now requires more hints from programmers already, e.g. _assume, taken hint and
the spin loop hint.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.