Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: My Bit Scan Conclusion

Author: Gerd Isenberg
Date: 03:30:11 12/18/02
On December 17, 2002 at 19:51:34, Matt Taylor wrote:

>On December 17, 2002 at 05:06:27, Matt Taylor wrote:
>
>>On December 17, 2002 at 01:47:27, Walter Faxon wrote:
>>
>>>On December 16, 2002 at 15:05:56, Gerd Isenberg wrote:
>>>
>>>>Hi all,
>>>>
>>>>My conclusion with Athlons bit scan (and reset) routines so far:
>>>>
>>>>Inlined bsf is the fastest for IsiChess, because of shortest code!
>>>>
>>>>Walter Faxon's magic LSB_64 and bitSearchAndReset_MMX (both faster not inlined)
>>>>both produce a about 3-5% slower IsiChess. LSB_64 and bitSearchAndReset_MMX are
>>>>equal in IsiChess. In a dumb-loop test bitSearchAndReset_MMX is clearly the
>>>>fastest (14.9 seconds) for 10^8 * 10 bits resets.
>>>>
>>><snip>
>>>
>>>
>>>It's understandable that even if bsf itself is relatively slow, its not needing
>>>so many registers to do its work may more than make up the difference.
>>>
>>>Still, I'm confused.  If the other two routines are not inlined, there must be
>>>at least a little procedure call overhead, right?  The registers they need will
>>>have to be saved/restored either way.  How can _not_ inlining produce faster
>>>code?  Matt, help!
>>>
>>>-- Walter
>>
>>Well, they were within a matter of cycles of each other in execution time for
>>one pass. I'm going to go ahead and assume Athlon here because I know the
>>timings and because I know Gerd has one ;)
>>
>>The x86 only gives you 8 GPRs. One is a stack pointer, and the compiler will
>>rightfully not touch it. Another is a frame pointer that the compiler often
>>avoids, too. Three GPRs can be accessed "for free." The two major compilers I
>>see used (Microsoft VC, GCC) assume eax, ecx, and edx change across a function.
>>They require you to preserve ebx, esi, edi, and ebp (frame pointer).
>>
>>If your code were inline, it would mean that the compiler can't cache as many
>>variables across a large function. The reloads can cost ~3 cycles L1 cache
>>latency each, and the timing otherwise differs only by 1-6 cycles or so
>>(depending on positions in the bitboard). In a smaller function, it would incur
>>the latency of preserving additional registers to go above the 3 GPR limit.
>>
>>The function call overhead itself is minimal on Athlon because they have a
>>function call cache which allows them to efficiently execute call/return
>>instructions. It nests up to 12 levels.
>>
>>I was trying to work on an assembly version of your code using SSE 2
>>instructions. It would enable 64-bit native, atomic arithmetic, and it would
>>avoid the register contention. They are only available on Pentium 4 at this
>>time, however. Emulating the 64-bit arithmetic, I was getting poor times, and I
>>wasn't even sure that my code was getting the correct result.
>>
>>You had an interesting suggestion in a previous thread, and I had not thought it
>>through terribly thoroughly until now; you had said that, if all that was
>>required was a single set bit, it would enable a wider range of implementations
>>with the possibility for better optimization.
>>
>>The goal of MMX/3DNow/SSE is to allow parallel execution of similar tasks. It
>>strikes me that one could implement an algorithm similar to the one you gave and
>>execute a parallel search on 2 32-bit operands. Pick either one at the end and
>>return the result. Memory access is problematic as I cannot think of any way to
>>index memory using an MMX/SSE register, but I am not fluent in MMX/SSE, and one
>>may exist.
>>
>>-Matt
>
>I ran the code on my roommate's Pentium 4 system. My conclusion is that my
>timing code is buggy. There is no way that my crudely hand-optimized routine
>beat out VC and GCC by that much.
>
>Sigh...
>
>It works well on Athlon. It's awful on Pentium 4. I just don't get it.

Hi Matt,

Same experience with Kogge-Stone or dumb7fill sliding attack mmx-getters.
About two times slower on P4 2.0GHz.

 Anyway,
>the fastest overall routines I got on Athlon were the GCC and VC routines with
>mine fluctuating near the same speed. The pi2fd comes in dead last, and bsf has
>the best times, but it does not have quite as good average-case time.

Interesting. In my dumb loop test the pi2fd version i posted with bitboard
ref-parameter is _clearly_ the fastest on my Athlon XP 2.1+. Are your bitboards
properly aligned?

>
>Walter's routine could be made even faster, I think. The end does a number of
>numeric operations, but they're interdependant. It seems to me that this part of
>the algorithm could be further improved. I've been too busy to look further into
>it, but I think some of the reduction can be done in parallel. If so, my little
>MMX version of his code could become the most efficient one yet. It's also nice
>because it doesn't tie up precious GPRs.

Yes, I will also try an mmx-implementation of Walter's LSB_64.

Any documentation available, how fast bsf will be on hammer? In the german
computer chess magazin c't 26/02 (http://www.heise.de) there was a test with an
1GHz Athlon 64 prototype (Clawhammer), suggesting an equal
mmx/3DNow!-performance compared to an 1GHz Athlon 32. Promising results with
current 32-bit software due to better memory latency and bandwidth. But no
results so far with "real" 64-bit code.

Cheers,
Gerd

>
>-Matt
Re: My Bit Scan Conclusion Matt Taylor 16:32:39 12/18/02
- Re: My Bit Scan Conclusion Walter Faxon 16:38:33 12/18/02
  - Re: My Bit Scan Conclusion Matt Taylor 20:39:15 12/18/02
Re: My Bit Scan Conclusion Matt Taylor 07:29:44 12/18/02
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.