Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: is this really faster?

Author: Tim Foden
Date: 02:16:04 04/18/03
On April 18, 2003 at 04:51:06, Gerd Isenberg wrote:

>On April 18, 2003 at 03:17:52, Gerd Isenberg wrote:
>
>>On April 17, 2003 at 17:03:00, Anthony Cozzie wrote:
>>
>>>I ask because
>>>
>>>1) mmx instructions have a 2-cycle latency on the athlon,
>>>2) getting the data over to the MMX pipe and back takes at *least* 6 cycles
>>>3) this is not really a 64 bit operation
>>>4) I tried AMDs 'optimized' version and it turned out to be much slower than the
>>>simple C hack
>>>
>>>I'd be very interested in any performance numbers you have.
>>>
>>>anthony
>>
>>Hi anthony,
>>
>>Yes, that may be true for single popcounts. I use this single one very rarely
>>but a lot of none inlined parallel versions to count the bits of up to four
>>bitboards simultaniosly (eg. counting center/king area weighted attacks), where
>>all 8-mmx registers are used and the instructions are scheduled in a proper way,
>>to break dependency chains. That gains something
>>(also only one final, dead slow vector path movd eax, mm0).
>>
>>For bitboards with low population probability i often use some inlines like
>>isBitCountGreaterOne or an assembler loop version below.
>>
>>Of course, general purpose register instructions are faster on x86-32, but if
>>you don't have enaugh of these registers ;-)
>>
>>Gerd
>>
>>
>
>I compared the single mmx-routine with the C-routine below - and the inlined mmx
>one seems to be faster in IsiChess (~1% Athlon XP2.1+), at least in some
>testpositios i tried. May be due to code size and cache effects or the general
>lack of registers.
>
>Gerd
>
>
>__forceinline
>int BitCount (BitBoard bb)
>{
>#ifdef USE_C_BITCOUNT
>	unsigned int l = LOWBOARD(bb);
>	unsigned int h = HIGHBOARD(bb);
>        l -= ((l >> 1) & 0x55555555);
>        h -= ((h >> 1) & 0x55555555);
>        l = (((l >> 2) & 0x33333333) + (l & 0x33333333));
>        h = (((h >> 2) & 0x33333333) + (h & 0x33333333));
>        l = (((l >> 4) + l) & 0x0f0f0f0f);
>        h = (((h >> 4) + h) & 0x0f0f0f0f);
>        l += (l >> 8);
>        h += (h >> 8);
>        l += (l >> 16);
>        h += (h >> 16);
>        return(l & 0x0000003f) + (h & 0x0000003f);

Hey Gerd,

Ever tried changing the above to something like this?

>	unsigned int l = LOWBOARD(bb);
>	unsigned int h = HIGHBOARD(bb);
>        l -= ((l >> 1) & 0x55555555);
>        h -= ((h >> 1) & 0x55555555);
>        l = (((l >> 2) & 0x33333333) + (l & 0x33333333));
>        h = (((h >> 2) & 0x33333333) + (h & 0x33333333));
>        l = (((l >> 4) + l) & 0x0f0f0f0f) + (((h >> 4) + h) & 0x0f0f0f0f);
>        l += (l >> 8);
>        l += (l >> 16);
>        return (l & 0x0000007f);

Or even this?

>	unsigned int l = LOWBOARD(bb);
>	unsigned int h = HIGHBOARD(bb);
>        l -= ((l >> 1) & 0x55555555);
>        h -= ((h >> 1) & 0x55555555);
>        l = (((l >> 2) & 0x33333333) + (l & 0x33333333)) +
>            (((h >> 2) & 0x33333333) + (h & 0x33333333));
>        l = (((l >> 4) & 0x0f0f0f0f) + (l & 0x0f0f0f0f));
>        l += (l >> 8);
>        l += (l >> 16);
>        return (l & 0x0000007f);

Take care... I just edited them... I haven't tested the changes at all, so they
may have bugs in... but I guess you see what I'm getting at.

Cheers, Tim.
Re: is this really faster? Gerd Isenberg 03:50:53 04/18/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.