Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: My Bit Scan Conclusion

Author: Matt Taylor

Date: 07:29:44 12/18/02

Go up one level in this thread


On December 18, 2002 at 06:30:11, Gerd Isenberg wrote:

>On December 17, 2002 at 19:51:34, Matt Taylor wrote:
>
>>On December 17, 2002 at 05:06:27, Matt Taylor wrote:
>>
>>>On December 17, 2002 at 01:47:27, Walter Faxon wrote:
>>>
>>>>On December 16, 2002 at 15:05:56, Gerd Isenberg wrote:
>>>>
>>>>>Hi all,
>>>>>
>>>>>My conclusion with Athlons bit scan (and reset) routines so far:
>>>>>
>>>>>Inlined bsf is the fastest for IsiChess, because of shortest code!
>>>>>
>>>>>Walter Faxon's magic LSB_64 and bitSearchAndReset_MMX (both faster not inlined)
>>>>>both produce a about 3-5% slower IsiChess. LSB_64 and bitSearchAndReset_MMX are
>>>>>equal in IsiChess. In a dumb-loop test bitSearchAndReset_MMX is clearly the
>>>>>fastest (14.9 seconds) for 10^8 * 10 bits resets.
>>>>>
>>>><snip>
>>>>
>>>>
>>>>It's understandable that even if bsf itself is relatively slow, its not needing
>>>>so many registers to do its work may more than make up the difference.
>>>>
>>>>Still, I'm confused.  If the other two routines are not inlined, there must be
>>>>at least a little procedure call overhead, right?  The registers they need will
>>>>have to be saved/restored either way.  How can _not_ inlining produce faster
>>>>code?  Matt, help!
>>>>
>>>>-- Walter
>>>
>>>Well, they were within a matter of cycles of each other in execution time for
>>>one pass. I'm going to go ahead and assume Athlon here because I know the
>>>timings and because I know Gerd has one ;)
>>>
>>>The x86 only gives you 8 GPRs. One is a stack pointer, and the compiler will
>>>rightfully not touch it. Another is a frame pointer that the compiler often
>>>avoids, too. Three GPRs can be accessed "for free." The two major compilers I
>>>see used (Microsoft VC, GCC) assume eax, ecx, and edx change across a function.
>>>They require you to preserve ebx, esi, edi, and ebp (frame pointer).
>>>
>>>If your code were inline, it would mean that the compiler can't cache as many
>>>variables across a large function. The reloads can cost ~3 cycles L1 cache
>>>latency each, and the timing otherwise differs only by 1-6 cycles or so
>>>(depending on positions in the bitboard). In a smaller function, it would incur
>>>the latency of preserving additional registers to go above the 3 GPR limit.
>>>
>>>The function call overhead itself is minimal on Athlon because they have a
>>>function call cache which allows them to efficiently execute call/return
>>>instructions. It nests up to 12 levels.
>>>
>>>I was trying to work on an assembly version of your code using SSE 2
>>>instructions. It would enable 64-bit native, atomic arithmetic, and it would
>>>avoid the register contention. They are only available on Pentium 4 at this
>>>time, however. Emulating the 64-bit arithmetic, I was getting poor times, and I
>>>wasn't even sure that my code was getting the correct result.
>>>
>>>You had an interesting suggestion in a previous thread, and I had not thought it
>>>through terribly thoroughly until now; you had said that, if all that was
>>>required was a single set bit, it would enable a wider range of implementations
>>>with the possibility for better optimization.
>>>
>>>The goal of MMX/3DNow/SSE is to allow parallel execution of similar tasks. It
>>>strikes me that one could implement an algorithm similar to the one you gave and
>>>execute a parallel search on 2 32-bit operands. Pick either one at the end and
>>>return the result. Memory access is problematic as I cannot think of any way to
>>>index memory using an MMX/SSE register, but I am not fluent in MMX/SSE, and one
>>>may exist.
>>>
>>>-Matt
>>
>>I ran the code on my roommate's Pentium 4 system. My conclusion is that my
>>timing code is buggy. There is no way that my crudely hand-optimized routine
>>beat out VC and GCC by that much.
>>
>>Sigh...
>>
>>It works well on Athlon. It's awful on Pentium 4. I just don't get it.
>
>Hi Matt,
>
>Same experience with Kogge-Stone or dumb7fill sliding attack mmx-getters.
>About two times slower on P4 2.0GHz.

Yes, but I meant the timing code itself does not work on the P4. I can't
possibly see how the results could be right.

> Anyway,
>>the fastest overall routines I got on Athlon were the GCC and VC routines with
>>mine fluctuating near the same speed. The pi2fd comes in dead last, and bsf has
>>the best times, but it does not have quite as good average-case time.
>
>Interesting. In my dumb loop test the pi2fd version i posted with bitboard
>ref-parameter is _clearly_ the fastest on my Athlon XP 2.1+. Are your bitboards
>properly aligned?

Not explicitly. I'll try explicitly aligning it. I've already removed cache
penalties, but if it's unaligned, it will still cause problems.

I'll try that. Thanks for the suggestion.

Part of the discrepancy here is also that you did a real-world test and I'm
comparing performance with variables stripped out (e.g. cache, function call
overheads, etc.) Other factors are obviously going to affect this, particularly
when the code gets inlined.

>>Walter's routine could be made even faster, I think. The end does a number of
>>numeric operations, but they're interdependant. It seems to me that this part of
>>the algorithm could be further improved. I've been too busy to look further into
>>it, but I think some of the reduction can be done in parallel. If so, my little
>>MMX version of his code could become the most efficient one yet. It's also nice
>>because it doesn't tie up precious GPRs.
>
>Yes, I will also try an mmx-implementation of Walter's LSB_64.
>
>Any documentation available, how fast bsf will be on hammer? In the german
>computer chess magazin c't 26/02 (http://www.heise.de) there was a test with an
>1GHz Athlon 64 prototype (Clawhammer), suggesting an equal
>mmx/3DNow!-performance compared to an 1GHz Athlon 32. Promising results with
>current 32-bit software due to better memory latency and bandwidth. But no
>results so far with "real" 64-bit code.
>
>Cheers,
>Gerd
>
>>
>>-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.