Author: Gerd Isenberg
Date: 03:50:53 04/18/03
Go up one level in this thread
On April 18, 2003 at 05:16:04, Tim Foden wrote: >On April 18, 2003 at 04:51:06, Gerd Isenberg wrote: > >>On April 18, 2003 at 03:17:52, Gerd Isenberg wrote: >> >>>On April 17, 2003 at 17:03:00, Anthony Cozzie wrote: >>> >>>>I ask because >>>> >>>>1) mmx instructions have a 2-cycle latency on the athlon, >>>>2) getting the data over to the MMX pipe and back takes at *least* 6 cycles >>>>3) this is not really a 64 bit operation >>>>4) I tried AMDs 'optimized' version and it turned out to be much slower than the >>>>simple C hack >>>> >>>>I'd be very interested in any performance numbers you have. >>>> >>>>anthony >>> >>>Hi anthony, >>> >>>Yes, that may be true for single popcounts. I use this single one very rarely >>>but a lot of none inlined parallel versions to count the bits of up to four >>>bitboards simultaniosly (eg. counting center/king area weighted attacks), where >>>all 8-mmx registers are used and the instructions are scheduled in a proper way, >>>to break dependency chains. That gains something >>>(also only one final, dead slow vector path movd eax, mm0). >>> >>>For bitboards with low population probability i often use some inlines like >>>isBitCountGreaterOne or an assembler loop version below. >>> >>>Of course, general purpose register instructions are faster on x86-32, but if >>>you don't have enaugh of these registers ;-) >>> >>>Gerd >>> >>> >> >>I compared the single mmx-routine with the C-routine below - and the inlined mmx >>one seems to be faster in IsiChess (~1% Athlon XP2.1+), at least in some >>testpositios i tried. May be due to code size and cache effects or the general >>lack of registers. >> >>Gerd >> >> >>__forceinline >>int BitCount (BitBoard bb) >>{ >>#ifdef USE_C_BITCOUNT >> unsigned int l = LOWBOARD(bb); >> unsigned int h = HIGHBOARD(bb); >> l -= ((l >> 1) & 0x55555555); >> h -= ((h >> 1) & 0x55555555); >> l = (((l >> 2) & 0x33333333) + (l & 0x33333333)); >> h = (((h >> 2) & 0x33333333) + (h & 0x33333333)); >> l = (((l >> 4) + l) & 0x0f0f0f0f); >> h = (((h >> 4) + h) & 0x0f0f0f0f); >> l += (l >> 8); >> h += (h >> 8); >> l += (l >> 16); >> h += (h >> 16); >> return(l & 0x0000003f) + (h & 0x0000003f); > >Hey Gerd, > >Ever tried changing the above to something like this? > >> unsigned int l = LOWBOARD(bb); >> unsigned int h = HIGHBOARD(bb); >> l -= ((l >> 1) & 0x55555555); >> h -= ((h >> 1) & 0x55555555); >> l = (((l >> 2) & 0x33333333) + (l & 0x33333333)); >> h = (((h >> 2) & 0x33333333) + (h & 0x33333333)); >> l = (((l >> 4) + l) & 0x0f0f0f0f) + (((h >> 4) + h) & 0x0f0f0f0f); >> l += (l >> 8); >> l += (l >> 16); >> return (l & 0x0000007f); > >Or even this? > >> unsigned int l = LOWBOARD(bb); >> unsigned int h = HIGHBOARD(bb); >> l -= ((l >> 1) & 0x55555555); >> h -= ((h >> 1) & 0x55555555); >> l = (((l >> 2) & 0x33333333) + (l & 0x33333333)) + >> (((h >> 2) & 0x33333333) + (h & 0x33333333)); >> l = (((l >> 4) & 0x0f0f0f0f) + (l & 0x0f0f0f0f)); >> l += (l >> 8); >> l += (l >> 16); >> return (l & 0x0000007f); > >Take care... I just edited them... I haven't tested the changes at all, so they >may have bugs in... but I guess you see what I'm getting at. > >Cheers, Tim. Hi Tim, yes one may save some instructions in some clever way - but one may introduce a few more dependencies. As i already mentioned, i'm rather happy with the mmx-stuff, specially with parallized versions to count multiple bitboards. So i don't like to put too much energy in this popcount stuff, because of lack of time - and x86-64 is near the horizon. I guess using 64-bit general purpose registers is unbeatable then ;-) Sample routine to count attack squares with following weighting factors: 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 2 3 3 3 3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 4 3 2 1 1 2 3 3 3 3 2 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 I count the center / extended center squares in one mmx-register and pass an additional bitboard eg. with neighboard king squares or passers stop squares. static int WeightedBitCount (BitBoard weightbb, BitBoard additional) { __asm { movd mm0, word ptr weightbb punpckldq mm0, word ptr weightbb + 4 movd mm6, word ptr additional punpckldq mm6, word ptr additional + 4 lea eax, [BitCountConsts] movq mm2,mm0 ; bb movq mm4,mm0 ; bb movq mm7,mm6 psrlq mm2,16 ; shift extended center to low board pand mm4, [eax].NOTBOARDER punpckldq mm2,mm2 ; mirror it to high board movq mm1,mm0 ; bb pand mm2, [eax].CENTER_MASK ; mask mirrored center as well as extended center movq mm5,mm4 psrld mm0,1 movq mm3,mm2 psrld mm2,1 psrld mm4,1 pand mm0, [eax].C55 psrld mm6,1 pand mm2, [eax].C55 pand mm4, [eax].C55 psubd mm1,mm0 pand mm6, [eax].C55 psubd mm3,mm2 psubd mm5,mm4 movq mm0,mm1 psubd mm7,mm6 movq mm2,mm3 movq mm4,mm5 psrld mm1,2 movq mm6,mm7 psrld mm3,2 psrld mm5,2 pand mm0,[eax].C33 pand mm1,[eax].C33 psrld mm7,2 pand mm2,[eax].C33 pand mm3,[eax].C33 pand mm4,[eax].C33 pand mm5,[eax].C33 paddd mm0,mm1 pand mm6,[eax].C33 pand mm7,[eax].C33 paddd mm2,mm3 paddd mm4,mm5 movq mm1,mm0 paddd mm6,mm7 movq mm3,mm2 movq mm5,mm4 psrld mm0,4 movq mm7,mm6 psrld mm2,4 psrld mm4,4 paddd mm0,mm1 psrld mm6,4 paddd mm2,mm3 paddd mm4,mm5 pxor mm1,mm1 pand mm0,[eax].C0F paddd mm6,mm7 pand mm2,[eax].C0F pand mm4,[eax].C0F psadbw (mm0,mm1) pand mm6,[eax].C0F psadbw (mm2,mm1) psadbw (mm4,mm1) paddd mm0,mm2 psadbw (mm6,mm1) paddd mm0,mm4 paddd mm0,mm6 movd eax,mm0 } } const SBitCountConsts CACHE_ALIGN BitCountConsts = { 0x3333333333333333, // C33 0x5555555555555555, // C55 0x0f0f0f0f0f0f0f0f, // C0F 0x007e7e7e7e7e7e00, // NOTBOARDER 0x3c3c3c3c00181800, // CENTER_MASK // (shifted and high low mirrored extended center and center mask) }; Cheers, Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.