Author: Matt Taylor
Date: 17:08:00 02/16/03
Go up one level in this thread
On February 16, 2003 at 15:55:37, Dezhi Zhao wrote:
>On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:
>
>>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>>
>>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>>
>>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>>
>>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>>
>>>>>Regards,
>>>>>Gerd
>>>
>>>What -did- you use before? After thinking about it, I don't see how these
>>>routines could make such an impact. It would be very difficult to shrink them
>>>further.
>>>
>>>Have you tried not inlining them? That would make the resulting code smaller.
>>>
>>>-Matt
>>
>>Hi Matt,
>>
>>this one (posted already x times) is the routine i compared with:
>>
>>__forceinline
>>UINT BitSearchAndReset(BitBoard &bb)
>>{
>> __asm
>> {
>> mov ebx, [bb]
>> xor eax, eax
>> mov edx, 1
>> bsf ecx, [ebx]
>> jnz found
>> bsf ecx, [ebx + 4]
>> lea ebx, [ebx + 4]
>> xor eax, 32
>> found:
>> shl edx, cl
>> xor eax, ecx
>> xor [ebx], edx
>> }
>>
>>}
>>
>>To my surprise this conditional routine using four registers is faster than this
>>short unconditional one but using three vector path instructions:
>>
>>__forceinline
>>UINT BitSearchAndReset(BitBoard &bb)
>>{
>> __asm
>> {
>> mov edx, [bb]
>> bsf eax, [edx+4]
>> xor eax, 32
>> bsf eax, [edx]
>> btr [edx],eax
>> }
>>}
>>
>>With none inlining, your routines and even walters magic lsb64 lookup routine
>>are all about the same in average, but the nps and solution time performance is
>>significantly worse (4-10%).
>>
>>Regards,
>>Gerd
>
>This sounds quite nomral to me. If you choose to inline them, you shorter
>version will be faster. Matt's functions are optimal in context of functions.
>If you choose to inline them, you have to factor in register usages and out of
>order execuation. Your shoter version uses fewer registers and the processor can
>often find OOE oppotunities to hide the 2nd bsf instruction.
First let me say that we are specifically talking about Athlon.
On Athlon, VectorPath instructions (like bsf) decode into a series of macro-ops.
OOOE cannot hide the 8 cycles of latency this produces because the microcoded
bsf instruction consumes most (all?) of the execution resources. At times I have
noted the bsf instruction taking only 7 cycles. Perhaps this is an effect of
OOOE.
Register usage is a non-issue because VC assumes that all general registers are
trashed over inline assembly. I can use 1 register of 5 registers; it makes no
difference. Besides -- he and I both use 4 registers.
>Of course, if you go hand inlining Mat's code into your code (rewrite all the
>callers in asm), you will see the speedup. And you find even more pairing
>oppotunites:)
Isn't that what the compiler is already doing?
>Have fun
>
>dzhao
-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.