Author: Matt Taylor
Date: 00:58:13 02/17/03
Go up one level in this thread
On February 16, 2003 at 22:05:38, Dezhi Zhao wrote:
>On February 16, 2003 at 20:08:00, Matt Taylor wrote:
>
>>On February 16, 2003 at 15:55:37, Dezhi Zhao wrote:
>>
>>>On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:
>>>
>>>>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>>>>
>>>>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>>>>
>>>>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>>>>
>>>>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>>>>
>>>>>>>Regards,
>>>>>>>Gerd
>>>>>
>>>>>What -did- you use before? After thinking about it, I don't see how these
>>>>>routines could make such an impact. It would be very difficult to shrink them
>>>>>further.
>>>>>
>>>>>Have you tried not inlining them? That would make the resulting code smaller.
>>>>>
>>>>>-Matt
>>>>
>>>>Hi Matt,
>>>>
>>>>this one (posted already x times) is the routine i compared with:
>>>>
>>>>__forceinline
>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>{
>>>> __asm
>>>> {
>>>> mov ebx, [bb]
>>>> xor eax, eax
>>>> mov edx, 1
>>>> bsf ecx, [ebx]
>>>> jnz found
>>>> bsf ecx, [ebx + 4]
>>>> lea ebx, [ebx + 4]
>>>> xor eax, 32
>>>> found:
>>>> shl edx, cl
>>>> xor eax, ecx
>>>> xor [ebx], edx
>>>> }
>>>>
>>>>}
>>>>
>>>>To my surprise this conditional routine using four registers is faster than this
>>>>short unconditional one but using three vector path instructions:
>>>>
>>>>__forceinline
>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>{
>>>> __asm
>>>> {
>>>> mov edx, [bb]
>>>> bsf eax, [edx+4]
>>>> xor eax, 32
>>>> bsf eax, [edx]
>>>> btr [edx],eax
>>>> }
>>>>}
>>>>
>>>>With none inlining, your routines and even walters magic lsb64 lookup routine
>>>>are all about the same in average, but the nps and solution time performance is
>>>>significantly worse (4-10%).
>>>>
>>>>Regards,
>>>>Gerd
>>>
>>>This sounds quite nomral to me. If you choose to inline them, you shorter
>>>version will be faster. Matt's functions are optimal in context of functions.
>>>If you choose to inline them, you have to factor in register usages and out of
>>>order execuation. Your shoter version uses fewer registers and the processor can
>>>often find OOE oppotunities to hide the 2nd bsf instruction.
>>
>>First let me say that we are specifically talking about Athlon.
>>
>>On Athlon, VectorPath instructions (like bsf) decode into a series of macro-ops.
>>OOOE cannot hide the 8 cycles of latency this produces because the microcoded
>>bsf instruction consumes most (all?) of the execution resources. At times I have
>>noted the bsf instruction taking only 7 cycles. Perhaps this is an effect of
>>OOOE.
>>
>I do not have an Athlon. However I assume it implement bsf, bsr, btr the same as
>P4. Yes, it is dfficult to hide an instruction of 8 cycle latency. You can
>manage to hide part of it sometimes, if you have another slow instruction in the
>stream, e.g. a shift operation.
Athlon uses a barrel shifter. Latency of 1 clock. I may be mistaken, but I
believe it has 1 attached to each of its 3 IEUs. Throughput of 3 shifts per
clock.
I don't know how the P4 implements bsf, bsr, and btr, but I know how the Athlon
implements the first two. For bsf/bsr, I am told it uses a binary search
algorithm. I do not know the exact implementation, but they may do something
like x & -x to trap the LSB (for bsf) and then use conditional masks. The bsr
instruction would be similar, but a different method for trapping the MSB.
I actually timed the algorithm Gerd posted. I don't recall exact numbers, but
there was very little overlap between the two bsf instructions. The total
execution time was at least 20 cycles. Both bsf instructions are listed as 8
cycles, and the btr is listed as 9 cycles. Another sequence of code was 1 cycle
faster and used 2 bsf instructions with a shift to reset the bit. The DirectPath
instructions used to emulate btr took approximately 5 cycles to execute.
>>Register usage is a non-issue because VC assumes that all general registers are
>>trashed over inline assembly. I can use 1 register of 5 registers; it makes no
>>difference. Besides -- he and I both use 4 registers.
>>
>
>If you watch the VC-generated code with inlined asm, you will find register
>spill-over. The more register the inline asm uses, the more spill-over. VC does
>not have a way to name a variable register.
No it doesn't, and that's a shame. However, at 3 registers you already have
spill-over. It really doesn't make a big difference...
I don't remember what I ended up putting in lsb.h, but I know at one point the
routine used 6 registers but preserved 3.
>>>Of course, if you go hand inlining Mat's code into your code (rewrite all the
>>>callers in asm), you will see the speedup. And you find even more pairing
>>>oppotunites:)
>>
>>Isn't that what the compiler is already doing?
>
>No. I mean the compiler does a bad job when it comes over inline asm. You can
>try it, I bet you wont like the code it generates. It generates some unnecessary
>memory operations to free registers.
>
>>
>>>Have fun
>>>
>>>dzhao
Yes, but both cases were inline assembly...
Throwing stuff out to memory isn't that bad. It can execute out-of-order, and it
should usually hit the L1 when referencing the stack. Reading it back in is the
slow part.
I personally like GCC's form which allows hints to be given to the compiler as
well as variable registers. I wish it were a little less idiosyncratic, though.
-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.