Author: Matt Taylor
Date: 16:51:34 12/17/02
Go up one level in this thread
On December 17, 2002 at 05:06:27, Matt Taylor wrote: >On December 17, 2002 at 01:47:27, Walter Faxon wrote: > >>On December 16, 2002 at 15:05:56, Gerd Isenberg wrote: >> >>>Hi all, >>> >>>My conclusion with Athlons bit scan (and reset) routines so far: >>> >>>Inlined bsf is the fastest for IsiChess, because of shortest code! >>> >>>Walter Faxon's magic LSB_64 and bitSearchAndReset_MMX (both faster not inlined) >>>both produce a about 3-5% slower IsiChess. LSB_64 and bitSearchAndReset_MMX are >>>equal in IsiChess. In a dumb-loop test bitSearchAndReset_MMX is clearly the >>>fastest (14.9 seconds) for 10^8 * 10 bits resets. >>> >><snip> >> >> >>It's understandable that even if bsf itself is relatively slow, its not needing >>so many registers to do its work may more than make up the difference. >> >>Still, I'm confused. If the other two routines are not inlined, there must be >>at least a little procedure call overhead, right? The registers they need will >>have to be saved/restored either way. How can _not_ inlining produce faster >>code? Matt, help! >> >>-- Walter > >Well, they were within a matter of cycles of each other in execution time for >one pass. I'm going to go ahead and assume Athlon here because I know the >timings and because I know Gerd has one ;) > >The x86 only gives you 8 GPRs. One is a stack pointer, and the compiler will >rightfully not touch it. Another is a frame pointer that the compiler often >avoids, too. Three GPRs can be accessed "for free." The two major compilers I >see used (Microsoft VC, GCC) assume eax, ecx, and edx change across a function. >They require you to preserve ebx, esi, edi, and ebp (frame pointer). > >If your code were inline, it would mean that the compiler can't cache as many >variables across a large function. The reloads can cost ~3 cycles L1 cache >latency each, and the timing otherwise differs only by 1-6 cycles or so >(depending on positions in the bitboard). In a smaller function, it would incur >the latency of preserving additional registers to go above the 3 GPR limit. > >The function call overhead itself is minimal on Athlon because they have a >function call cache which allows them to efficiently execute call/return >instructions. It nests up to 12 levels. > >I was trying to work on an assembly version of your code using SSE 2 >instructions. It would enable 64-bit native, atomic arithmetic, and it would >avoid the register contention. They are only available on Pentium 4 at this >time, however. Emulating the 64-bit arithmetic, I was getting poor times, and I >wasn't even sure that my code was getting the correct result. > >You had an interesting suggestion in a previous thread, and I had not thought it >through terribly thoroughly until now; you had said that, if all that was >required was a single set bit, it would enable a wider range of implementations >with the possibility for better optimization. > >The goal of MMX/3DNow/SSE is to allow parallel execution of similar tasks. It >strikes me that one could implement an algorithm similar to the one you gave and >execute a parallel search on 2 32-bit operands. Pick either one at the end and >return the result. Memory access is problematic as I cannot think of any way to >index memory using an MMX/SSE register, but I am not fluent in MMX/SSE, and one >may exist. > >-Matt I ran the code on my roommate's Pentium 4 system. My conclusion is that my timing code is buggy. There is no way that my crudely hand-optimized routine beat out VC and GCC by that much. Sigh... It works well on Athlon. It's awful on Pentium 4. I just don't get it. Anyway, the fastest overall routines I got on Athlon were the GCC and VC routines with mine fluctuating near the same speed. The pi2fd comes in dead last, and bsf has the best times, but it does not have quite as good average-case time. Walter's routine could be made even faster, I think. The end does a number of numeric operations, but they're interdependant. It seems to me that this part of the algorithm could be further improved. I've been too busy to look further into it, but I think some of the reduction can be done in parallel. If so, my little MMX version of his code could become the most efficient one yet. It's also nice because it doesn't tie up precious GPRs. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.