Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: mmx-PopCount

Author: Dann Corbit

Date: 21:26:47 02/15/02

Go up one level in this thread


On February 15, 2002 at 13:15:03, Gerd Isenberg wrote:

>My first trials using mmx-registers were very frustrating.
>And i simply missed some opcodes, like shifting bytewise.
>
>So the one and only piece of mmx-code in my chess program was the well known
>mmx-PopCount from AMD.
>Currently i'm thinking about a "mmx-parallized" BitScan-routine, to traverse a
>BitBoard, finding the bit-indizies simultaniously for all ranks. It looks
>promising, not only because of SingleInstructionMultipleData, but also of less
>memory traffic, because of using mmx-registers in inner loops. The mmx-extension
>opcode for K7, "PMOVMSKB reg32, mmreg" is very usefull here.
>
>But today my experience with AMD's mmx-PopCount:
>
>The original PopCount from AMD's x86 Code optimazation guide:
>
>#include "amd3d.h"
>__declspec (naked)unsigned int __stdcall popcount64
>(unsigned __int64 v)
>{
>static const __int64 C55 =0x5555555555555555;
>static const __int64 C33 =0x3333333333333333;
>static const __int64 C0F =0x0F0F0F0F0F0F0F0F;
>__asm {
>  MOVD MM0,[ESP+4 ];v_low
>  PUNPCKLDQ MM0,[ESP+8 ];v
>  MOVQ MM1,MM0 ;v
>  PSRLD MM0,1 ;v >>1
>  PAND MM0,[C55 ] ;;(v >>1)&0x55555555
>  PSUBD MM1,MM0 ;w =v -((v >>1)& 0x55555555)
>  MOVQ MM0,MM1 ;w
>  PSRLD MM1,2 ;w >>2
>  PAND MM0,[C33 ] ;w &0x33333333
>  PAND MM1,[C33 ] ;(w >>2)&0x33333333
>  PADDD MM0,MM1 ;x =(w &0x33333333)+ ((w >>2)&0x33333333)
>  MOVQ MM1,MM0 ;x
>  PSRLD MM0,4 ;x >>4
>  PADDD MM0,MM1 ;x +(x >>4)
>  PAND MM0,[C0F ] ;y =(x +(x >>4)&;0x0F0F0F0F)
>  PXOR MM1,MM1 ;0
>  PSADBW (MM0,MM1);sum across all 8 bytes
>  MOVD EAX,MM0 ;result in EAX per calling convention
>  FEMMS ;clear MMX state
>  RET 8 ;pop 8-byte argument off
>  ;stack and return
> }
>}
>
>Few days ago i modified it by using a kind of
> "movq mmmreg, immediate32:immediate32"
>instead of reading 64Bit constants (aligned) from memory.
>
>To load a highlow-equal 64Bit constant into one mmx-register,
>three opcodes are neccessary:
>
>  mov  eax, 0x33333333
>  movd mm2, eax
>  punpckldq mm2, mm2
>
>I mixed these sequences inside the other opcodes to reduce dependency chains.
>And it seems to be a slighly faster for me (in NPS), but always slightly or
>significant better solution times on AMD K7 1,4GHz 256MB SDRAM. (I don't have
>any profiler information about relative duration of PopCount, but i use it quite
>often in eval, specially the combined as mentioned below).
>
>My inlined version:
>
>#ifdef ATHLON_MMX
>#include <amd3dx.h>
>
>#pragma warning(disable:4035) // no return
>
>__forceinline int BitCount (BitBoard bb)
>{
>	__asm
>	{
>		movd mm0, word ptr bb;
>		mov	 eax, 0x55555555
>		punpckldq mm0, word ptr bb + 4;
>		movd mm2, eax
>
>		movq mm1,mm0;
>		mov  eax, 0x33333333
>		punpckldq mm2, mm2
>		psrld mm0,1;
>		pand mm0,mm2;
>		psubd mm1,mm0;
>		movd mm2, eax
>		movq mm0,mm1
>		mov  eax, 0x0f0f0f0f
>		punpckldq mm2, mm2
>		psrld mm1, 2
>		pand mm0,mm2
>		pand mm1,mm2;
>		paddd mm0,mm1;
>		movd mm2, eax
>		movq mm1,mm0;
>		psrld mm0,4;
>		punpckldq mm2, mm2
>		paddd mm0,mm1;
>		pand mm0,mm2;
>		pxor mm1,mm1;
>		psadbw (mm0,mm1);
>		movd eax,mm0;
>//		femms;
>	}
>}
>...

I gave it a try on Athlon.  If there is any difference, it isn't dramatic.

I can't get away with leaving the femms out, since Beowulf does some floating
point.
[snip]



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.