Author: Matt Taylor
Date: 02:06:27 12/17/02
Go up one level in this thread
On December 17, 2002 at 01:47:27, Walter Faxon wrote: >On December 16, 2002 at 15:05:56, Gerd Isenberg wrote: > >>Hi all, >> >>My conclusion with Athlons bit scan (and reset) routines so far: >> >>Inlined bsf is the fastest for IsiChess, because of shortest code! >> >>Walter Faxon's magic LSB_64 and bitSearchAndReset_MMX (both faster not inlined) >>both produce a about 3-5% slower IsiChess. LSB_64 and bitSearchAndReset_MMX are >>equal in IsiChess. In a dumb-loop test bitSearchAndReset_MMX is clearly the >>fastest (14.9 seconds) for 10^8 * 10 bits resets. >> ><snip> > > >It's understandable that even if bsf itself is relatively slow, its not needing >so many registers to do its work may more than make up the difference. > >Still, I'm confused. If the other two routines are not inlined, there must be >at least a little procedure call overhead, right? The registers they need will >have to be saved/restored either way. How can _not_ inlining produce faster >code? Matt, help! > >-- Walter Well, they were within a matter of cycles of each other in execution time for one pass. I'm going to go ahead and assume Athlon here because I know the timings and because I know Gerd has one ;) The x86 only gives you 8 GPRs. One is a stack pointer, and the compiler will rightfully not touch it. Another is a frame pointer that the compiler often avoids, too. Three GPRs can be accessed "for free." The two major compilers I see used (Microsoft VC, GCC) assume eax, ecx, and edx change across a function. They require you to preserve ebx, esi, edi, and ebp (frame pointer). If your code were inline, it would mean that the compiler can't cache as many variables across a large function. The reloads can cost ~3 cycles L1 cache latency each, and the timing otherwise differs only by 1-6 cycles or so (depending on positions in the bitboard). In a smaller function, it would incur the latency of preserving additional registers to go above the 3 GPR limit. The function call overhead itself is minimal on Athlon because they have a function call cache which allows them to efficiently execute call/return instructions. It nests up to 12 levels. I was trying to work on an assembly version of your code using SSE 2 instructions. It would enable 64-bit native, atomic arithmetic, and it would avoid the register contention. They are only available on Pentium 4 at this time, however. Emulating the 64-bit arithmetic, I was getting poor times, and I wasn't even sure that my code was getting the correct result. You had an interesting suggestion in a previous thread, and I had not thought it through terribly thoroughly until now; you had said that, if all that was required was a single set bit, it would enable a wider range of implementations with the possibility for better optimization. The goal of MMX/3DNow/SSE is to allow parallel execution of similar tasks. It strikes me that one could implement an algorithm similar to the one you gave and execute a parallel search on 2 32-bit operands. Pick either one at the end and return the result. Memory access is problematic as I cannot think of any way to index memory using an MMX/SSE register, but I am not fluent in MMX/SSE, and one may exist. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.