Author: Gerd Isenberg
Date: 03:50:53 04/18/03
Go up one level in this thread
On April 18, 2003 at 05:16:04, Tim Foden wrote:
>On April 18, 2003 at 04:51:06, Gerd Isenberg wrote:
>
>>On April 18, 2003 at 03:17:52, Gerd Isenberg wrote:
>>
>>>On April 17, 2003 at 17:03:00, Anthony Cozzie wrote:
>>>
>>>>I ask because
>>>>
>>>>1) mmx instructions have a 2-cycle latency on the athlon,
>>>>2) getting the data over to the MMX pipe and back takes at *least* 6 cycles
>>>>3) this is not really a 64 bit operation
>>>>4) I tried AMDs 'optimized' version and it turned out to be much slower than the
>>>>simple C hack
>>>>
>>>>I'd be very interested in any performance numbers you have.
>>>>
>>>>anthony
>>>
>>>Hi anthony,
>>>
>>>Yes, that may be true for single popcounts. I use this single one very rarely
>>>but a lot of none inlined parallel versions to count the bits of up to four
>>>bitboards simultaniosly (eg. counting center/king area weighted attacks), where
>>>all 8-mmx registers are used and the instructions are scheduled in a proper way,
>>>to break dependency chains. That gains something
>>>(also only one final, dead slow vector path movd eax, mm0).
>>>
>>>For bitboards with low population probability i often use some inlines like
>>>isBitCountGreaterOne or an assembler loop version below.
>>>
>>>Of course, general purpose register instructions are faster on x86-32, but if
>>>you don't have enaugh of these registers ;-)
>>>
>>>Gerd
>>>
>>>
>>
>>I compared the single mmx-routine with the C-routine below - and the inlined mmx
>>one seems to be faster in IsiChess (~1% Athlon XP2.1+), at least in some
>>testpositios i tried. May be due to code size and cache effects or the general
>>lack of registers.
>>
>>Gerd
>>
>>
>>__forceinline
>>int BitCount (BitBoard bb)
>>{
>>#ifdef USE_C_BITCOUNT
>> unsigned int l = LOWBOARD(bb);
>> unsigned int h = HIGHBOARD(bb);
>> l -= ((l >> 1) & 0x55555555);
>> h -= ((h >> 1) & 0x55555555);
>> l = (((l >> 2) & 0x33333333) + (l & 0x33333333));
>> h = (((h >> 2) & 0x33333333) + (h & 0x33333333));
>> l = (((l >> 4) + l) & 0x0f0f0f0f);
>> h = (((h >> 4) + h) & 0x0f0f0f0f);
>> l += (l >> 8);
>> h += (h >> 8);
>> l += (l >> 16);
>> h += (h >> 16);
>> return(l & 0x0000003f) + (h & 0x0000003f);
>
>Hey Gerd,
>
>Ever tried changing the above to something like this?
>
>> unsigned int l = LOWBOARD(bb);
>> unsigned int h = HIGHBOARD(bb);
>> l -= ((l >> 1) & 0x55555555);
>> h -= ((h >> 1) & 0x55555555);
>> l = (((l >> 2) & 0x33333333) + (l & 0x33333333));
>> h = (((h >> 2) & 0x33333333) + (h & 0x33333333));
>> l = (((l >> 4) + l) & 0x0f0f0f0f) + (((h >> 4) + h) & 0x0f0f0f0f);
>> l += (l >> 8);
>> l += (l >> 16);
>> return (l & 0x0000007f);
>
>Or even this?
>
>> unsigned int l = LOWBOARD(bb);
>> unsigned int h = HIGHBOARD(bb);
>> l -= ((l >> 1) & 0x55555555);
>> h -= ((h >> 1) & 0x55555555);
>> l = (((l >> 2) & 0x33333333) + (l & 0x33333333)) +
>> (((h >> 2) & 0x33333333) + (h & 0x33333333));
>> l = (((l >> 4) & 0x0f0f0f0f) + (l & 0x0f0f0f0f));
>> l += (l >> 8);
>> l += (l >> 16);
>> return (l & 0x0000007f);
>
>Take care... I just edited them... I haven't tested the changes at all, so they
>may have bugs in... but I guess you see what I'm getting at.
>
>Cheers, Tim.
Hi Tim,
yes one may save some instructions in some clever way - but one may introduce a
few more dependencies. As i already mentioned, i'm rather happy with the
mmx-stuff, specially with parallized versions to count multiple bitboards. So i
don't like to put too much energy in this popcount stuff, because of lack of
time - and x86-64 is near the horizon. I guess using 64-bit general purpose
registers is unbeatable then ;-)
Sample routine to count attack squares with following weighting factors:
1 1 1 1 1 1 1 1
1 2 2 2 2 2 2 1
1 2 3 3 3 3 2 1
1 2 3 4 4 3 2 1
1 2 3 4 4 3 2 1
1 2 3 3 3 3 2 1
1 2 2 2 2 2 2 1
1 1 1 1 1 1 1 1
I count the center / extended center squares in one mmx-register and pass an
additional bitboard eg. with neighboard king squares or passers stop squares.
static
int WeightedBitCount (BitBoard weightbb, BitBoard additional)
{
__asm
{
movd mm0, word ptr weightbb
punpckldq mm0, word ptr weightbb + 4
movd mm6, word ptr additional
punpckldq mm6, word ptr additional + 4
lea eax, [BitCountConsts]
movq mm2,mm0 ; bb
movq mm4,mm0 ; bb
movq mm7,mm6
psrlq mm2,16 ; shift extended center to low board
pand mm4, [eax].NOTBOARDER
punpckldq mm2,mm2 ; mirror it to high board
movq mm1,mm0 ; bb
pand mm2, [eax].CENTER_MASK
; mask mirrored center as well as extended center
movq mm5,mm4
psrld mm0,1
movq mm3,mm2
psrld mm2,1
psrld mm4,1
pand mm0, [eax].C55
psrld mm6,1
pand mm2, [eax].C55
pand mm4, [eax].C55
psubd mm1,mm0
pand mm6, [eax].C55
psubd mm3,mm2
psubd mm5,mm4
movq mm0,mm1
psubd mm7,mm6
movq mm2,mm3
movq mm4,mm5
psrld mm1,2
movq mm6,mm7
psrld mm3,2
psrld mm5,2
pand mm0,[eax].C33
pand mm1,[eax].C33
psrld mm7,2
pand mm2,[eax].C33
pand mm3,[eax].C33
pand mm4,[eax].C33
pand mm5,[eax].C33
paddd mm0,mm1
pand mm6,[eax].C33
pand mm7,[eax].C33
paddd mm2,mm3
paddd mm4,mm5
movq mm1,mm0
paddd mm6,mm7
movq mm3,mm2
movq mm5,mm4
psrld mm0,4
movq mm7,mm6
psrld mm2,4
psrld mm4,4
paddd mm0,mm1
psrld mm6,4
paddd mm2,mm3
paddd mm4,mm5
pxor mm1,mm1
pand mm0,[eax].C0F
paddd mm6,mm7
pand mm2,[eax].C0F
pand mm4,[eax].C0F
psadbw (mm0,mm1)
pand mm6,[eax].C0F
psadbw (mm2,mm1)
psadbw (mm4,mm1)
paddd mm0,mm2
psadbw (mm6,mm1)
paddd mm0,mm4
paddd mm0,mm6
movd eax,mm0
}
}
const SBitCountConsts CACHE_ALIGN BitCountConsts =
{
0x3333333333333333, // C33
0x5555555555555555, // C55
0x0f0f0f0f0f0f0f0f, // C0F
0x007e7e7e7e7e7e00, // NOTBOARDER
0x3c3c3c3c00181800, // CENTER_MASK
// (shifted and high low mirrored extended center and center mask)
};
Cheers,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.