Computer Chess Club Archives

Search

Terms

Messages

Subject: mmx-PopCount

Author: Gerd Isenberg

Date: 10:15:03 02/15/02

My first trials using mmx-registers were very frustrating.
And i simply missed some opcodes, like shifting bytewise.

So the one and only piece of mmx-code in my chess program was the well known
mmx-PopCount from AMD.
Currently i'm thinking about a "mmx-parallized" BitScan-routine, to traverse a
BitBoard, finding the bit-indizies simultaniously for all ranks. It looks
promising, not only because of SingleInstructionMultipleData, but also of less
memory traffic, because of using mmx-registers in inner loops. The mmx-extension
opcode for K7, "PMOVMSKB reg32, mmreg" is very usefull here.

But today my experience with AMD's mmx-PopCount:

The original PopCount from AMD's x86 Code optimazation guide:

#include "amd3d.h"
__declspec (naked)unsigned int __stdcall popcount64
(unsigned __int64 v)
{
static const __int64 C55 =0x5555555555555555;
static const __int64 C33 =0x3333333333333333;
static const __int64 C0F =0x0F0F0F0F0F0F0F0F;
__asm {
  MOVD MM0,[ESP+4 ];v_low
  PUNPCKLDQ MM0,[ESP+8 ];v
  MOVQ MM1,MM0 ;v
  PSRLD MM0,1 ;v >>1
  PAND MM0,[C55 ] ;;(v >>1)&0x55555555
  PSUBD MM1,MM0 ;w =v -((v >>1)& 0x55555555)
  MOVQ MM0,MM1 ;w
  PSRLD MM1,2 ;w >>2
  PAND MM0,[C33 ] ;w &0x33333333
  PAND MM1,[C33 ] ;(w >>2)&0x33333333
  PADDD MM0,MM1 ;x =(w &0x33333333)+ ((w >>2)&0x33333333)
  MOVQ MM1,MM0 ;x
  PSRLD MM0,4 ;x >>4
  PADDD MM0,MM1 ;x +(x >>4)
  PAND MM0,[C0F ] ;y =(x +(x >>4)&;0x0F0F0F0F)
  PXOR MM1,MM1 ;0
  PSADBW (MM0,MM1);sum across all 8 bytes
  MOVD EAX,MM0 ;result in EAX per calling convention
  FEMMS ;clear MMX state
  RET 8 ;pop 8-byte argument off
  ;stack and return
 }
}

Few days ago i modified it by using a kind of
 "movq mmmreg, immediate32:immediate32"
instead of reading 64Bit constants (aligned) from memory.

To load a highlow-equal 64Bit constant into one mmx-register,
three opcodes are neccessary:

  mov  eax, 0x33333333
  movd mm2, eax
  punpckldq mm2, mm2

I mixed these sequences inside the other opcodes to reduce dependency chains.
And it seems to be a slighly faster for me (in NPS), but always slightly or
significant better solution times on AMD K7 1,4GHz 256MB SDRAM. (I don't have
any profiler information about relative duration of PopCount, but i use it quite
often in eval, specially the combined as mentioned below).

My inlined version:

#ifdef ATHLON_MMX
#include <amd3dx.h>

#pragma warning(disable:4035) // no return

__forceinline int BitCount (BitBoard bb)
{
	__asm
	{
		movd mm0, word ptr bb;
		mov	 eax, 0x55555555
		punpckldq mm0, word ptr bb + 4;
		movd mm2, eax

		movq mm1,mm0;
		mov  eax, 0x33333333
		punpckldq mm2, mm2
		psrld mm0,1;
		pand mm0,mm2;
		psubd mm1,mm0;
		movd mm2, eax
		movq mm0,mm1
		mov  eax, 0x0f0f0f0f
		punpckldq mm2, mm2
		psrld mm1, 2
		pand mm0,mm2
		pand mm1,mm2;
		paddd mm0,mm1;
		movd mm2, eax
		movq mm1,mm0;
		psrld mm0,4;
		punpckldq mm2, mm2
		paddd mm0,mm1;
		pand mm0,mm2;
		pxor mm1,mm1;
		psadbw (mm0,mm1);
		movd eax,mm0;
//		femms;
	}
}
...

I use similar routines, to count up to four Bitboards simultaniously, e.g. to
determine center weighted piece controls. The orininal piece attack BitBoard in
one mmx-register is copied into three others. These three copies are masked
consecutively. The first excludes boarders, the second includes only the outer
center (16 center squares) the third only the inner center ( 4 center squares).
Then the above algorithm is done four times, using all eight mmx-registers and
mixing independend opcodes together. Performance is clearly better than doing it
four times with a single PopCount.

My questions to all using similar AMD PopCount. Is it faster to avoid the memory
reads in your program, too? Can you confirm my experience, or is it more a
random effect?

cheers,
Gerd

Re: mmx-PopCount Dann Corbit 21:26:47 02/15/02
- Re: mmx-PopCount maria clara benedicto 00:17:18 02/17/02
  - Re: mmx-PopCount Gian-Carlo Pascutto 01:03:58 02/17/02
    - Re: mmx-PopCount thanks gian carlo p. n.text maria clara benedicto 07:19:58 02/18/02

This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.