Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: 64Bit optimize coding - My experience (AMD64)

Author: Andreas Guettinger

Date: 04:14:27 12/22/05

On December 22, 2005 at 02:08:46, Frank E. Oldham wrote:

>On December 21, 2005 at 17:09:02, Andreas Guettinger wrote:
>
>>On December 21, 2005 at 16:18:19, Tord Romstad wrote:
>>
>>>On December 21, 2005 at 15:30:07, Andreas Guettinger wrote:
>>>
>>>>Well, the problem is that the -DINLINE_PPC flag does nothing on your sources. :)
>>>>I realized that a big part of the speedup seems to come from the asm
>>>>bitoperating routines.
>>>
>>>I see.  Thanks for the code and explanations.
>>>
>>>Is there any easy way to achieve similar speeds with plain, portable C
>>>code?  If resorting to inline assembly language is really necessary to
>>>produce a fast 64 bit binary, I would be very disappointed.  I detest
>>>assembly language.
>>>
>>>Tord
>>
>>Well, it's actually only two (2) lines of assembly code (one command) if you
>>look at it.
>>I'm quite suprised that the speedup of the assmbly command for cntlzd (count
>>leading zeros) is that high. It works with 64bit, though. The euqivalent for
>>32bit would be cntlzw.
>>
>>from ppc_intrinisc.h:
>>
>>/*
>> * __cntlzw - Count Leading Zeros Word
>> * __cntlzd - Count Leading Zeros Double Word
>> */
>>
>>#define __cntlzw(a)     __builtin_clz(a)
>>#define __cntlzd(a)     __builtin_clzll(a)
>>
>>
>>I agree in sense of portability asm code is bad, I don't like it too. Crafty
>>falls back to plain C code for FirstOne(), LastOne() if no inline_asm or
>>inline_ppc is used. For the G4 I found the speedup to use cntlzw instead
>>neglectable (maybe 5%). Surprisingly this seems to be different for 64bit.
>>Crafty seems to rely heavily on these bitcounting function, but that may be
>>different for other engines and they may not profit a lot from the asm code.
>>Additionally the C lookup table approach of crafty that replaces the asm
>>functions might be not the fastest for ppc. There was a thread some time ago
>>where different algorithms where compared (magic bitscan, table, gerd, eugene,
>>etc. ) I still have the source for them lying around somwhere.
>>
>>regards
>>Andy
>
>It's more the fact that gcc3.3 generates such bad code calling the non-inline
>versions.
>BTW, gcc4.0.1 and OSX 10.4.3 give a slight speedup in crafty 19.9, perhaps
>because -fast becomes
>usable.  If you can't use it for your sources, then at least put in some of the
>alignment options -- they can make a big difference.  Also, you should become
>familiar with Shark -- Apple's free profiling tool,
>which can show you problems in your code.  Shark is part of the CHUD package and
>there's a pretty good tutorial for it on the main developer pages.
>
>Frank

I used gcc version 4.0.1 (Apple Computer, Inc. build 5247) for the speedup
comparisions.

regards
Andy

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.