Author: Gerd Isenberg
Date: 10:15:03 02/15/02
My first trials using mmx-registers were very frustrating.
And i simply missed some opcodes, like shifting bytewise.
So the one and only piece of mmx-code in my chess program was the well known
mmx-PopCount from AMD.
Currently i'm thinking about a "mmx-parallized" BitScan-routine, to traverse a
BitBoard, finding the bit-indizies simultaniously for all ranks. It looks
promising, not only because of SingleInstructionMultipleData, but also of less
memory traffic, because of using mmx-registers in inner loops. The mmx-extension
opcode for K7, "PMOVMSKB reg32, mmreg" is very usefull here.
But today my experience with AMD's mmx-PopCount:
The original PopCount from AMD's x86 Code optimazation guide:
#include "amd3d.h"
__declspec (naked)unsigned int __stdcall popcount64
(unsigned __int64 v)
{
static const __int64 C55 =0x5555555555555555;
static const __int64 C33 =0x3333333333333333;
static const __int64 C0F =0x0F0F0F0F0F0F0F0F;
__asm {
MOVD MM0,[ESP+4 ];v_low
PUNPCKLDQ MM0,[ESP+8 ];v
MOVQ MM1,MM0 ;v
PSRLD MM0,1 ;v >>1
PAND MM0,[C55 ] ;;(v >>1)&0x55555555
PSUBD MM1,MM0 ;w =v -((v >>1)& 0x55555555)
MOVQ MM0,MM1 ;w
PSRLD MM1,2 ;w >>2
PAND MM0,[C33 ] ;w &0x33333333
PAND MM1,[C33 ] ;(w >>2)&0x33333333
PADDD MM0,MM1 ;x =(w &0x33333333)+ ((w >>2)&0x33333333)
MOVQ MM1,MM0 ;x
PSRLD MM0,4 ;x >>4
PADDD MM0,MM1 ;x +(x >>4)
PAND MM0,[C0F ] ;y =(x +(x >>4)&;0x0F0F0F0F)
PXOR MM1,MM1 ;0
PSADBW (MM0,MM1);sum across all 8 bytes
MOVD EAX,MM0 ;result in EAX per calling convention
FEMMS ;clear MMX state
RET 8 ;pop 8-byte argument off
;stack and return
}
}
Few days ago i modified it by using a kind of
"movq mmmreg, immediate32:immediate32"
instead of reading 64Bit constants (aligned) from memory.
To load a highlow-equal 64Bit constant into one mmx-register,
three opcodes are neccessary:
mov eax, 0x33333333
movd mm2, eax
punpckldq mm2, mm2
I mixed these sequences inside the other opcodes to reduce dependency chains.
And it seems to be a slighly faster for me (in NPS), but always slightly or
significant better solution times on AMD K7 1,4GHz 256MB SDRAM. (I don't have
any profiler information about relative duration of PopCount, but i use it quite
often in eval, specially the combined as mentioned below).
My inlined version:
#ifdef ATHLON_MMX
#include <amd3dx.h>
#pragma warning(disable:4035) // no return
__forceinline int BitCount (BitBoard bb)
{
__asm
{
movd mm0, word ptr bb;
mov eax, 0x55555555
punpckldq mm0, word ptr bb + 4;
movd mm2, eax
movq mm1,mm0;
mov eax, 0x33333333
punpckldq mm2, mm2
psrld mm0,1;
pand mm0,mm2;
psubd mm1,mm0;
movd mm2, eax
movq mm0,mm1
mov eax, 0x0f0f0f0f
punpckldq mm2, mm2
psrld mm1, 2
pand mm0,mm2
pand mm1,mm2;
paddd mm0,mm1;
movd mm2, eax
movq mm1,mm0;
psrld mm0,4;
punpckldq mm2, mm2
paddd mm0,mm1;
pand mm0,mm2;
pxor mm1,mm1;
psadbw (mm0,mm1);
movd eax,mm0;
// femms;
}
}
...
I use similar routines, to count up to four Bitboards simultaniously, e.g. to
determine center weighted piece controls. The orininal piece attack BitBoard in
one mmx-register is copied into three others. These three copies are masked
consecutively. The first excludes boarders, the second includes only the outer
center (16 center squares) the third only the inner center ( 4 center squares).
Then the above algorithm is done four times, using all eight mmx-registers and
mixing independend opcodes together. Performance is clearly better than doing it
four times with a single PopCount.
My questions to all using similar AMD PopCount. Is it faster to avoid the memory
reads in your program, too? Can you confirm my experience, or is it more a
random effect?
cheers,
Gerd
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.