Author: Gerd Isenberg
Date: 00:17:52 04/18/03
Go up one level in this thread
On April 17, 2003 at 17:03:00, Anthony Cozzie wrote: >I ask because > >1) mmx instructions have a 2-cycle latency on the athlon, >2) getting the data over to the MMX pipe and back takes at *least* 6 cycles >3) this is not really a 64 bit operation >4) I tried AMDs 'optimized' version and it turned out to be much slower than the >simple C hack > >I'd be very interested in any performance numbers you have. > >anthony Hi anthony, Yes, that may be true for single popcounts. I use this single one very rarely but a lot of none inlined parallel versions to count the bits of up to four bitboards simultaniosly (eg. counting center/king area weighted attacks), where all 8-mmx registers are used and the instructions are scheduled in a proper way, to break dependency chains. That gains something (also only one final, dead slow vector path movd eax, mm0). For bitboards with low population probability i often use some inlines like isBitCountGreaterOne or an assembler loop version below. Of course, general purpose register instructions are faster on x86-32, but if you don't have enaugh of these registers ;-) Gerd static __forceinline int _fastcall BitCount8 (BitBoard bb) { __asm { mov ecx, dword ptr bb xor eax, eax test ecx, ecx jz l1 l0: lea edx, [ecx-1] inc eax and ecx, edx jnz l0 l1: mov ecx, dword ptr bb+4 test ecx, ecx jz l3 l2: lea edx, [ecx-1] inc eax and ecx, edx jnz l2 l3: } }
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.