Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: My Bit Scan Conclusion

Author: Gerd Isenberg

Date: 03:30:50 12/17/02

Go up one level in this thread


On December 17, 2002 at 05:08:25, Matt Taylor wrote:

>On December 17, 2002 at 03:58:32, Gerd Isenberg wrote:
>
>>On December 17, 2002 at 01:47:27, Walter Faxon wrote:
>>
>>>On December 16, 2002 at 15:05:56, Gerd Isenberg wrote:
>>>
>>>>Hi all,
>>>>
>>>>My conclusion with Athlons bit scan (and reset) routines so far:
>>>>
>>>>Inlined bsf is the fastest for IsiChess, because of shortest code!
>>>>
>>>>Walter Faxon's magic LSB_64 and bitSearchAndReset_MMX (both faster not inlined)
>>>>both produce a about 3-5% slower IsiChess. LSB_64 and bitSearchAndReset_MMX are
>>>>equal in IsiChess. In a dumb-loop test bitSearchAndReset_MMX is clearly the
>>>>fastest (14.9 seconds) for 10^8 * 10 bits resets.
>>>>
>>><snip>
>>>
>>>
>>>It's understandable that even if bsf itself is relatively slow, its not needing
>>>so many registers to do its work may more than make up the difference.
>>>
>>>Still, I'm confused.  If the other two routines are not inlined, there must be
>>>at least a little procedure call overhead, right?  The registers they need will
>>>have to be saved/restored either way.  How can _not_ inlining produce faster
>>>code?  Matt, help!
>>>
>>>-- Walter
>>
>>Hi Walter,
>>
>>it depends on the program of course. In IsiChess inlined LSB_64 is definitely
>>slower than _not_ inlined. Even code must be read into L1-cache, and more code
>>implies more memory reads - there are so many "chaoic" interferences, at least
>>in IsiChess, but i guess also in other programs.
>>
>>Cheers,
>>Gerd
>
>Shouldn't be a big issue, though. They aggressively speculatively fetch code
>into the L1 cache, and the cache is huge. AMD advises in all the optimization
>guides to "aggressively unroll loops [because of large cache size]."
>
>The register contention seems more likely IMO...
>
>-Matt

Hi Matt,

Yes, even amd's optimization guide says:
"Always Inline Functions with Fewer than 25 Machine Instructions"
which is most often true, but it seems not always, at leat in my program ;-)
The bitSearchAndReset_MMX (MM0) uses only eax as general purpose register, and
has only 23 (20) mmx-instructions, but IsiChess is a few percent slower with
inlined rather than not inlined versions of this routines. The opposite is true
with the bsf routine, here the inlined version is clearly favourably.

As i said before, there are so many "chaoic" interferences...
In general i get better results with space rather than speed optimization.

Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.