Author: Gerd Isenberg
Date: 03:30:11 12/18/02
Go up one level in this thread
On December 17, 2002 at 19:51:34, Matt Taylor wrote: >On December 17, 2002 at 05:06:27, Matt Taylor wrote: > >>On December 17, 2002 at 01:47:27, Walter Faxon wrote: >> >>>On December 16, 2002 at 15:05:56, Gerd Isenberg wrote: >>> >>>>Hi all, >>>> >>>>My conclusion with Athlons bit scan (and reset) routines so far: >>>> >>>>Inlined bsf is the fastest for IsiChess, because of shortest code! >>>> >>>>Walter Faxon's magic LSB_64 and bitSearchAndReset_MMX (both faster not inlined) >>>>both produce a about 3-5% slower IsiChess. LSB_64 and bitSearchAndReset_MMX are >>>>equal in IsiChess. In a dumb-loop test bitSearchAndReset_MMX is clearly the >>>>fastest (14.9 seconds) for 10^8 * 10 bits resets. >>>> >>><snip> >>> >>> >>>It's understandable that even if bsf itself is relatively slow, its not needing >>>so many registers to do its work may more than make up the difference. >>> >>>Still, I'm confused. If the other two routines are not inlined, there must be >>>at least a little procedure call overhead, right? The registers they need will >>>have to be saved/restored either way. How can _not_ inlining produce faster >>>code? Matt, help! >>> >>>-- Walter >> >>Well, they were within a matter of cycles of each other in execution time for >>one pass. I'm going to go ahead and assume Athlon here because I know the >>timings and because I know Gerd has one ;) >> >>The x86 only gives you 8 GPRs. One is a stack pointer, and the compiler will >>rightfully not touch it. Another is a frame pointer that the compiler often >>avoids, too. Three GPRs can be accessed "for free." The two major compilers I >>see used (Microsoft VC, GCC) assume eax, ecx, and edx change across a function. >>They require you to preserve ebx, esi, edi, and ebp (frame pointer). >> >>If your code were inline, it would mean that the compiler can't cache as many >>variables across a large function. The reloads can cost ~3 cycles L1 cache >>latency each, and the timing otherwise differs only by 1-6 cycles or so >>(depending on positions in the bitboard). In a smaller function, it would incur >>the latency of preserving additional registers to go above the 3 GPR limit. >> >>The function call overhead itself is minimal on Athlon because they have a >>function call cache which allows them to efficiently execute call/return >>instructions. It nests up to 12 levels. >> >>I was trying to work on an assembly version of your code using SSE 2 >>instructions. It would enable 64-bit native, atomic arithmetic, and it would >>avoid the register contention. They are only available on Pentium 4 at this >>time, however. Emulating the 64-bit arithmetic, I was getting poor times, and I >>wasn't even sure that my code was getting the correct result. >> >>You had an interesting suggestion in a previous thread, and I had not thought it >>through terribly thoroughly until now; you had said that, if all that was >>required was a single set bit, it would enable a wider range of implementations >>with the possibility for better optimization. >> >>The goal of MMX/3DNow/SSE is to allow parallel execution of similar tasks. It >>strikes me that one could implement an algorithm similar to the one you gave and >>execute a parallel search on 2 32-bit operands. Pick either one at the end and >>return the result. Memory access is problematic as I cannot think of any way to >>index memory using an MMX/SSE register, but I am not fluent in MMX/SSE, and one >>may exist. >> >>-Matt > >I ran the code on my roommate's Pentium 4 system. My conclusion is that my >timing code is buggy. There is no way that my crudely hand-optimized routine >beat out VC and GCC by that much. > >Sigh... > >It works well on Athlon. It's awful on Pentium 4. I just don't get it. Hi Matt, Same experience with Kogge-Stone or dumb7fill sliding attack mmx-getters. About two times slower on P4 2.0GHz. Anyway, >the fastest overall routines I got on Athlon were the GCC and VC routines with >mine fluctuating near the same speed. The pi2fd comes in dead last, and bsf has >the best times, but it does not have quite as good average-case time. Interesting. In my dumb loop test the pi2fd version i posted with bitboard ref-parameter is _clearly_ the fastest on my Athlon XP 2.1+. Are your bitboards properly aligned? > >Walter's routine could be made even faster, I think. The end does a number of >numeric operations, but they're interdependant. It seems to me that this part of >the algorithm could be further improved. I've been too busy to look further into >it, but I think some of the reduction can be done in parallel. If so, my little >MMX version of his code could become the most efficient one yet. It's also nice >because it doesn't tie up precious GPRs. Yes, I will also try an mmx-implementation of Walter's LSB_64. Any documentation available, how fast bsf will be on hammer? In the german computer chess magazin c't 26/02 (http://www.heise.de) there was a test with an 1GHz Athlon 64 prototype (Clawhammer), suggesting an equal mmx/3DNow!-performance compared to an 1GHz Athlon 32. Promising results with current 32-bit software due to better memory latency and bandwidth. But no results so far with "real" 64-bit code. Cheers, Gerd > >-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.