Author: Matt Taylor
Date: 17:12:50 02/16/03
Go up one level in this thread
On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:
>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>
>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>
>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>
>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>
>>>>Regards,
>>>>Gerd
>>
>>What -did- you use before? After thinking about it, I don't see how these
>>routines could make such an impact. It would be very difficult to shrink them
>>further.
>>
>>Have you tried not inlining them? That would make the resulting code smaller.
>>
>>-Matt
>
>Hi Matt,
>
>this one (posted already x times) is the routine i compared with:
>
>__forceinline
>UINT BitSearchAndReset(BitBoard &bb)
>{
> __asm
> {
> mov ebx, [bb]
> xor eax, eax
> mov edx, 1
> bsf ecx, [ebx]
> jnz found
> bsf ecx, [ebx + 4]
> lea ebx, [ebx + 4]
> xor eax, 32
> found:
> shl edx, cl
> xor eax, ecx
> xor [ebx], edx
> }
>
>}
At least 30 bytes. (Crude estimation; I did not take into account that bb
resolves to something like esp+x or ebp+x.)
>To my surprise this conditional routine using four registers is faster than this
>short unconditional one but using three vector path instructions:
>
>__forceinline
>UINT BitSearchAndReset(BitBoard &bb)
>{
> __asm
> {
> mov edx, [bb]
> bsf eax, [edx+4]
> xor eax, 32
> bsf eax, [edx]
> btr [edx],eax
> }
>}
Yes, each bsf instruction costs 8 cycles, and btr costs another 9.
>With none inlining, your routines and even walters magic lsb64 lookup routine
>are all about the same in average, but the nps and solution time performance is
>significantly worse (4-10%).
>
>Regards,
>Gerd
Yes, the 10+ cycles latency from the function call would tend to minimize the
benefit of 3 saved cycles. :-)
I can't figure out why performance would be better looping the bsf instruction
rather than using cmov and executing it once. The code size isn't much
different; perhaps there is some subtle side-effect at work here? Perhaps my
choice of esi to store bb in was a bad choice. I usually prefer esi to ebx
because esi can be used in string instructions and VC likes to use ebx when
aligning variables on the stack.
-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.