Author: Tim Foden
Date: 02:16:04 04/18/03
Go up one level in this thread
On April 18, 2003 at 04:51:06, Gerd Isenberg wrote: >On April 18, 2003 at 03:17:52, Gerd Isenberg wrote: > >>On April 17, 2003 at 17:03:00, Anthony Cozzie wrote: >> >>>I ask because >>> >>>1) mmx instructions have a 2-cycle latency on the athlon, >>>2) getting the data over to the MMX pipe and back takes at *least* 6 cycles >>>3) this is not really a 64 bit operation >>>4) I tried AMDs 'optimized' version and it turned out to be much slower than the >>>simple C hack >>> >>>I'd be very interested in any performance numbers you have. >>> >>>anthony >> >>Hi anthony, >> >>Yes, that may be true for single popcounts. I use this single one very rarely >>but a lot of none inlined parallel versions to count the bits of up to four >>bitboards simultaniosly (eg. counting center/king area weighted attacks), where >>all 8-mmx registers are used and the instructions are scheduled in a proper way, >>to break dependency chains. That gains something >>(also only one final, dead slow vector path movd eax, mm0). >> >>For bitboards with low population probability i often use some inlines like >>isBitCountGreaterOne or an assembler loop version below. >> >>Of course, general purpose register instructions are faster on x86-32, but if >>you don't have enaugh of these registers ;-) >> >>Gerd >> >> > >I compared the single mmx-routine with the C-routine below - and the inlined mmx >one seems to be faster in IsiChess (~1% Athlon XP2.1+), at least in some >testpositios i tried. May be due to code size and cache effects or the general >lack of registers. > >Gerd > > >__forceinline >int BitCount (BitBoard bb) >{ >#ifdef USE_C_BITCOUNT > unsigned int l = LOWBOARD(bb); > unsigned int h = HIGHBOARD(bb); > l -= ((l >> 1) & 0x55555555); > h -= ((h >> 1) & 0x55555555); > l = (((l >> 2) & 0x33333333) + (l & 0x33333333)); > h = (((h >> 2) & 0x33333333) + (h & 0x33333333)); > l = (((l >> 4) + l) & 0x0f0f0f0f); > h = (((h >> 4) + h) & 0x0f0f0f0f); > l += (l >> 8); > h += (h >> 8); > l += (l >> 16); > h += (h >> 16); > return(l & 0x0000003f) + (h & 0x0000003f); Hey Gerd, Ever tried changing the above to something like this? > unsigned int l = LOWBOARD(bb); > unsigned int h = HIGHBOARD(bb); > l -= ((l >> 1) & 0x55555555); > h -= ((h >> 1) & 0x55555555); > l = (((l >> 2) & 0x33333333) + (l & 0x33333333)); > h = (((h >> 2) & 0x33333333) + (h & 0x33333333)); > l = (((l >> 4) + l) & 0x0f0f0f0f) + (((h >> 4) + h) & 0x0f0f0f0f); > l += (l >> 8); > l += (l >> 16); > return (l & 0x0000007f); Or even this? > unsigned int l = LOWBOARD(bb); > unsigned int h = HIGHBOARD(bb); > l -= ((l >> 1) & 0x55555555); > h -= ((h >> 1) & 0x55555555); > l = (((l >> 2) & 0x33333333) + (l & 0x33333333)) + > (((h >> 2) & 0x33333333) + (h & 0x33333333)); > l = (((l >> 4) & 0x0f0f0f0f) + (l & 0x0f0f0f0f)); > l += (l >> 8); > l += (l >> 16); > return (l & 0x0000007f); Take care... I just edited them... I haven't tested the changes at all, so they may have bugs in... but I guess you see what I'm getting at. Cheers, Tim.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.