Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: My Bit Scan Conclusion

Author: Matt Taylor

Date: 02:06:27 12/17/02

Go up one level in this thread


On December 17, 2002 at 01:47:27, Walter Faxon wrote:

>On December 16, 2002 at 15:05:56, Gerd Isenberg wrote:
>
>>Hi all,
>>
>>My conclusion with Athlons bit scan (and reset) routines so far:
>>
>>Inlined bsf is the fastest for IsiChess, because of shortest code!
>>
>>Walter Faxon's magic LSB_64 and bitSearchAndReset_MMX (both faster not inlined)
>>both produce a about 3-5% slower IsiChess. LSB_64 and bitSearchAndReset_MMX are
>>equal in IsiChess. In a dumb-loop test bitSearchAndReset_MMX is clearly the
>>fastest (14.9 seconds) for 10^8 * 10 bits resets.
>>
><snip>
>
>
>It's understandable that even if bsf itself is relatively slow, its not needing
>so many registers to do its work may more than make up the difference.
>
>Still, I'm confused.  If the other two routines are not inlined, there must be
>at least a little procedure call overhead, right?  The registers they need will
>have to be saved/restored either way.  How can _not_ inlining produce faster
>code?  Matt, help!
>
>-- Walter

Well, they were within a matter of cycles of each other in execution time for
one pass. I'm going to go ahead and assume Athlon here because I know the
timings and because I know Gerd has one ;)

The x86 only gives you 8 GPRs. One is a stack pointer, and the compiler will
rightfully not touch it. Another is a frame pointer that the compiler often
avoids, too. Three GPRs can be accessed "for free." The two major compilers I
see used (Microsoft VC, GCC) assume eax, ecx, and edx change across a function.
They require you to preserve ebx, esi, edi, and ebp (frame pointer).

If your code were inline, it would mean that the compiler can't cache as many
variables across a large function. The reloads can cost ~3 cycles L1 cache
latency each, and the timing otherwise differs only by 1-6 cycles or so
(depending on positions in the bitboard). In a smaller function, it would incur
the latency of preserving additional registers to go above the 3 GPR limit.

The function call overhead itself is minimal on Athlon because they have a
function call cache which allows them to efficiently execute call/return
instructions. It nests up to 12 levels.

I was trying to work on an assembly version of your code using SSE 2
instructions. It would enable 64-bit native, atomic arithmetic, and it would
avoid the register contention. They are only available on Pentium 4 at this
time, however. Emulating the 64-bit arithmetic, I was getting poor times, and I
wasn't even sure that my code was getting the correct result.

You had an interesting suggestion in a previous thread, and I had not thought it
through terribly thoroughly until now; you had said that, if all that was
required was a single set bit, it would enable a wider range of implementations
with the possibility for better optimization.

The goal of MMX/3DNow/SSE is to allow parallel execution of similar tasks. It
strikes me that one could implement an algorithm similar to the one you gave and
execute a parallel search on 2 32-bit operands. Pick either one at the end and
return the result. Memory access is problematic as I cannot think of any way to
index memory using an MMX/SSE register, but I am not fluent in MMX/SSE, and one
may exist.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.