Author: Gerd Isenberg
Date: 01:51:06 04/18/03
Go up one level in this thread
On April 18, 2003 at 03:17:52, Gerd Isenberg wrote: >On April 17, 2003 at 17:03:00, Anthony Cozzie wrote: > >>I ask because >> >>1) mmx instructions have a 2-cycle latency on the athlon, >>2) getting the data over to the MMX pipe and back takes at *least* 6 cycles >>3) this is not really a 64 bit operation >>4) I tried AMDs 'optimized' version and it turned out to be much slower than the >>simple C hack >> >>I'd be very interested in any performance numbers you have. >> >>anthony > >Hi anthony, > >Yes, that may be true for single popcounts. I use this single one very rarely >but a lot of none inlined parallel versions to count the bits of up to four >bitboards simultaniosly (eg. counting center/king area weighted attacks), where >all 8-mmx registers are used and the instructions are scheduled in a proper way, >to break dependency chains. That gains something >(also only one final, dead slow vector path movd eax, mm0). > >For bitboards with low population probability i often use some inlines like >isBitCountGreaterOne or an assembler loop version below. > >Of course, general purpose register instructions are faster on x86-32, but if >you don't have enaugh of these registers ;-) > >Gerd > > I compared the single mmx-routine with the C-routine below - and the inlined mmx one seems to be faster in IsiChess (~1% Athlon XP2.1+), at least in some testpositios i tried. May be due to code size and cache effects or the general lack of registers. Gerd __forceinline int BitCount (BitBoard bb) { #ifdef USE_C_BITCOUNT unsigned int l = LOWBOARD(bb); unsigned int h = HIGHBOARD(bb); l -= ((l >> 1) & 0x55555555); h -= ((h >> 1) & 0x55555555); l = (((l >> 2) & 0x33333333) + (l & 0x33333333)); h = (((h >> 2) & 0x33333333) + (h & 0x33333333)); l = (((l >> 4) + l) & 0x0f0f0f0f); h = (((h >> 4) + h) & 0x0f0f0f0f); l += (l >> 8); h += (h >> 8); l += (l >> 16); h += (h >> 16); return(l & 0x0000003f) + (h & 0x0000003f); #else __asm { movd mm0, word ptr bb punpckldq mm0, word ptr bb + 4 lea eax, [BitCountConsts] movq mm1,mm0 psrld mm0,1 pand mm0,[eax].C55 psubd mm1,mm0 movq mm0,mm1 psrld mm1,2 pand mm0,[eax].C33 pand mm1,[eax].C33 paddd mm0,mm1 movq mm1,mm0 psrld mm0,4 paddd mm0,mm1 pand mm0,[eax].C0F pxor mm1,mm1 psadbw (mm0,mm1) movd eax,mm0 } #endif }
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.