Author: Dann Corbit
Date: 21:26:47 02/15/02
Go up one level in this thread
On February 15, 2002 at 13:15:03, Gerd Isenberg wrote:
>My first trials using mmx-registers were very frustrating.
>And i simply missed some opcodes, like shifting bytewise.
>
>So the one and only piece of mmx-code in my chess program was the well known
>mmx-PopCount from AMD.
>Currently i'm thinking about a "mmx-parallized" BitScan-routine, to traverse a
>BitBoard, finding the bit-indizies simultaniously for all ranks. It looks
>promising, not only because of SingleInstructionMultipleData, but also of less
>memory traffic, because of using mmx-registers in inner loops. The mmx-extension
>opcode for K7, "PMOVMSKB reg32, mmreg" is very usefull here.
>
>But today my experience with AMD's mmx-PopCount:
>
>The original PopCount from AMD's x86 Code optimazation guide:
>
>#include "amd3d.h"
>__declspec (naked)unsigned int __stdcall popcount64
>(unsigned __int64 v)
>{
>static const __int64 C55 =0x5555555555555555;
>static const __int64 C33 =0x3333333333333333;
>static const __int64 C0F =0x0F0F0F0F0F0F0F0F;
>__asm {
> MOVD MM0,[ESP+4 ];v_low
> PUNPCKLDQ MM0,[ESP+8 ];v
> MOVQ MM1,MM0 ;v
> PSRLD MM0,1 ;v >>1
> PAND MM0,[C55 ] ;;(v >>1)&0x55555555
> PSUBD MM1,MM0 ;w =v -((v >>1)& 0x55555555)
> MOVQ MM0,MM1 ;w
> PSRLD MM1,2 ;w >>2
> PAND MM0,[C33 ] ;w &0x33333333
> PAND MM1,[C33 ] ;(w >>2)&0x33333333
> PADDD MM0,MM1 ;x =(w &0x33333333)+ ((w >>2)&0x33333333)
> MOVQ MM1,MM0 ;x
> PSRLD MM0,4 ;x >>4
> PADDD MM0,MM1 ;x +(x >>4)
> PAND MM0,[C0F ] ;y =(x +(x >>4)&;0x0F0F0F0F)
> PXOR MM1,MM1 ;0
> PSADBW (MM0,MM1);sum across all 8 bytes
> MOVD EAX,MM0 ;result in EAX per calling convention
> FEMMS ;clear MMX state
> RET 8 ;pop 8-byte argument off
> ;stack and return
> }
>}
>
>Few days ago i modified it by using a kind of
> "movq mmmreg, immediate32:immediate32"
>instead of reading 64Bit constants (aligned) from memory.
>
>To load a highlow-equal 64Bit constant into one mmx-register,
>three opcodes are neccessary:
>
> mov eax, 0x33333333
> movd mm2, eax
> punpckldq mm2, mm2
>
>I mixed these sequences inside the other opcodes to reduce dependency chains.
>And it seems to be a slighly faster for me (in NPS), but always slightly or
>significant better solution times on AMD K7 1,4GHz 256MB SDRAM. (I don't have
>any profiler information about relative duration of PopCount, but i use it quite
>often in eval, specially the combined as mentioned below).
>
>My inlined version:
>
>#ifdef ATHLON_MMX
>#include <amd3dx.h>
>
>#pragma warning(disable:4035) // no return
>
>__forceinline int BitCount (BitBoard bb)
>{
> __asm
> {
> movd mm0, word ptr bb;
> mov eax, 0x55555555
> punpckldq mm0, word ptr bb + 4;
> movd mm2, eax
>
> movq mm1,mm0;
> mov eax, 0x33333333
> punpckldq mm2, mm2
> psrld mm0,1;
> pand mm0,mm2;
> psubd mm1,mm0;
> movd mm2, eax
> movq mm0,mm1
> mov eax, 0x0f0f0f0f
> punpckldq mm2, mm2
> psrld mm1, 2
> pand mm0,mm2
> pand mm1,mm2;
> paddd mm0,mm1;
> movd mm2, eax
> movq mm1,mm0;
> psrld mm0,4;
> punpckldq mm2, mm2
> paddd mm0,mm1;
> pand mm0,mm2;
> pxor mm1,mm1;
> psadbw (mm0,mm1);
> movd eax,mm0;
>// femms;
> }
>}
>...
I gave it a try on Athlon. If there is any difference, it isn't dramatic.
I can't get away with leaving the femms out, since Beowulf does some floating
point.
[snip]
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.