Author: Dezhi Zhao
Date: 19:05:38 02/16/03
Go up one level in this thread
On February 16, 2003 at 20:08:00, Matt Taylor wrote:
>On February 16, 2003 at 15:55:37, Dezhi Zhao wrote:
>
>>On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:
>>
>>>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>>>
>>>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>>>
>>>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>>>
>>>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>>>
>>>>>>Regards,
>>>>>>Gerd
>>>>
>>>>What -did- you use before? After thinking about it, I don't see how these
>>>>routines could make such an impact. It would be very difficult to shrink them
>>>>further.
>>>>
>>>>Have you tried not inlining them? That would make the resulting code smaller.
>>>>
>>>>-Matt
>>>
>>>Hi Matt,
>>>
>>>this one (posted already x times) is the routine i compared with:
>>>
>>>__forceinline
>>>UINT BitSearchAndReset(BitBoard &bb)
>>>{
>>> __asm
>>> {
>>> mov ebx, [bb]
>>> xor eax, eax
>>> mov edx, 1
>>> bsf ecx, [ebx]
>>> jnz found
>>> bsf ecx, [ebx + 4]
>>> lea ebx, [ebx + 4]
>>> xor eax, 32
>>> found:
>>> shl edx, cl
>>> xor eax, ecx
>>> xor [ebx], edx
>>> }
>>>
>>>}
>>>
>>>To my surprise this conditional routine using four registers is faster than this
>>>short unconditional one but using three vector path instructions:
>>>
>>>__forceinline
>>>UINT BitSearchAndReset(BitBoard &bb)
>>>{
>>> __asm
>>> {
>>> mov edx, [bb]
>>> bsf eax, [edx+4]
>>> xor eax, 32
>>> bsf eax, [edx]
>>> btr [edx],eax
>>> }
>>>}
>>>
>>>With none inlining, your routines and even walters magic lsb64 lookup routine
>>>are all about the same in average, but the nps and solution time performance is
>>>significantly worse (4-10%).
>>>
>>>Regards,
>>>Gerd
>>
>>This sounds quite nomral to me. If you choose to inline them, you shorter
>>version will be faster. Matt's functions are optimal in context of functions.
>>If you choose to inline them, you have to factor in register usages and out of
>>order execuation. Your shoter version uses fewer registers and the processor can
>>often find OOE oppotunities to hide the 2nd bsf instruction.
>
>First let me say that we are specifically talking about Athlon.
>
>On Athlon, VectorPath instructions (like bsf) decode into a series of macro-ops.
>OOOE cannot hide the 8 cycles of latency this produces because the microcoded
>bsf instruction consumes most (all?) of the execution resources. At times I have
>noted the bsf instruction taking only 7 cycles. Perhaps this is an effect of
>OOOE.
>
I do not have an Athlon. However I assume it implement bsf, bsr, btr the same as
P4. Yes, it is dfficult to hide an instruction of 8 cycle latency. You can
manage to hide part of it sometimes, if you have another slow instruction in the
stream, e.g. a shift operation.
>Register usage is a non-issue because VC assumes that all general registers are
>trashed over inline assembly. I can use 1 register of 5 registers; it makes no
>difference. Besides -- he and I both use 4 registers.
>
If you watch the VC-generated code with inlined asm, you will find register
spill-over. The more register the inline asm uses, the more spill-over. VC does
not have a way to name a variable register.
>>Of course, if you go hand inlining Mat's code into your code (rewrite all the
>>callers in asm), you will see the speedup. And you find even more pairing
>>oppotunites:)
>
>Isn't that what the compiler is already doing?
No. I mean the compiler does a bad job when it comes over inline asm. You can
try it, I bet you wont like the code it generates. It generates some unnecessary
memory operations to free registers.
>
>>Have fun
>>
>>dzhao
>
>-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.