Computer Chess Club Archives




Subject: On SSE2-Intrinsics

Author: Gerd Isenberg

Date: 05:16:33 03/28/05

Hi all,

Some notes on using Streaming SIMD Extensions 2 (SSE2) for computer chess,
specially on SSE2-Intrinsics:

Without stalls the SSE2-integer instruction throughput is about one instruction
per cycle (AMD64). That is because most SSE2-instructions have a latency of two
cycles and usually perform two 64-bit alu operations.

AMD64 32/64-bit general purpose instructions perform much better on single ints
or __int64 - up to four instructions per cycle.

OTOH as vectors are involved, SSE2 becomes relative faster, the smaller the
SIMD-Data Type is (Bitboard[2], int[4], shorts[8] or Byte[16] per xmm-register).
Also SSE2 has some powerfull instructions here and there, without an easy
gp-instructions equivalent.

Another aspect - SSE2-code reduces gp-register pressure and may execute in
"parallel" with independent gp-code. Future processors may also have 128-bit
alus and faster SSE-execution (let say one cycle latency ;-) as well.

IMHO there are some "nice" bitboard applications using SSE2 gems.

Three options to use SSE2 for vectors of integers with
8/16 (P4,Centrino/AMD64) 128-bit xmm-registers:

1. (Intel) compiler supports automatic vectorization.
2. (deprecated) inline assembly with msvc.
3. SSE2-Intrinsics (including C++ wrappers for vector classes)

As mentioned recently by Aart J.C. Bik, Intel Compiler is able of automatic
vectorization of loops on arrays of byte, short, int and __int64 with
appropriate SSE2-instructions. The results are promising, but there are still
some applications as a assembler domain, for instance
bit[64]*byte[64]-dot-product (see sample 1 below).

Since inline assembly is no longer supported by msc for AMD64, 64-bit mode. One
"wintel-portable" alternative solution is to use SSE2-intrinsics, as well for
P4, Centrino and AMD64 and it's intel clone with 64-bit extensions.

Using intrinsics is not really a win of abstraction - it reminds more on a
C-Level assembler, with the advantage the compiler allocates registers and
schedules instructions.

For me, in the meantime rather familar with see2-assembly mnemonics, the way to
port routines to SSE2-intrinsics is first to write (inline) assembly and then
performing a more or less automatic "one to one"-replacement of SSE2
assembler-instructions with the appropriate intrinsics. Since SSE2 instrinsics
prototype is often m128i = _mm_xxxx_si128 (__m128i, __m128i);
one may skip some movdqa xmm-reg, xmm-reg.

Also in the first step i keep the identifiers of xmm-registers as "usual"
__m128i register variables. Loading and storing data is a bit different, see
application 2 below.

I tested a few routines using ms visual C++ 2005 beta 32-bit compiler. The
generated code of the intrinsics looks well (no more unnessesary store and
loads) and the routines produce obout the same speep as the inline assembly


Two applications:

1. The bit[64]*byte[64]-dot-product.

This is really a SSE2-domain!

#include <emmintrin.h>
typedef unsigned __int64 BitBoard;
#define XMM_ALIGN __declspec(align(16))

/* pure C version */

int dotProduct_C(BitBoard bb, BYTE weights[])
  int sum = 0;
  for (int i=0; i < 64; i++, bb >>= 1)
    if ( bb & 1 ) sum += weights[i];
  return sum;

/* intrinsic-version ~42 cycles */
/* for average weights < 64 !   */

int dotProduct(BitBoard bb, BYTE weights[] /* XMM_ALIGN */)
  static const BitBoard XMM_ALIGN sbitmask[2] = {
  __m128i x0, x1, x2, x3, bm;
  bm = _mm_load_si128  ( (__m128i*) sbitmask);
  x0 = _mm_loadl_epi64 ( (__m128i*) &bb);
  x0 = _mm_unpacklo_epi8  (x0, x0);
  x2 = _mm_unpackhi_epi16 (x0, x0);
  x0 = _mm_unpacklo_epi16 (x0, x0);
  x1 = _mm_unpackhi_epi32 (x0, x0);
  x0 = _mm_unpacklo_epi32 (x0, x0);
  x3 = _mm_unpackhi_epi32 (x2, x2);
  x2 = _mm_unpacklo_epi32 (x2, x2);
  // extend bits to bytes
  x0 = _mm_cmpeq_epi8 (_mm_and_si128  (x0, bm), bm);
  x1 = _mm_cmpeq_epi8 (_mm_and_si128  (x1, bm), bm);
  x2 = _mm_cmpeq_epi8 (_mm_and_si128  (x2, bm), bm);
  x3 = _mm_cmpeq_epi8 (_mm_and_si128  (x3, bm), bm);
  // multiply by "and" with -1 or 0
  __m128i* pw = (__m128i*) weights;
  x0 = _mm_and_si128  (x0, _mm_load_si128 (pw+0));
  x1 = _mm_and_si128  (x1, _mm_load_si128 (pw+1));
  x2 = _mm_and_si128  (x2, _mm_load_si128 (pw+2));
  x3 = _mm_and_si128  (x3, _mm_load_si128 (pw+3));
  // add all bytes (with saturation)
  x0 = _mm_adds_epu8  (x0, x1);
  x0 = _mm_adds_epu8  (x0, x2);
  x0 = _mm_adds_epu8  (x0, x3);
  x0 = _mm_sad_epu8   (x0, _mm_setzero_si128 ());
  return _mm_extract_epi16 (x0, 0)
       + _mm_extract_epi16 (x0, 4);

2. Expanding a quad-bitboard to 16-disjoint bitboards.

This is more a domain of 64-bit compiler, using gp-registers.
But of course a clear win for a SSE2-capable x86-32.

/* pure C */
void quad2hexBB_C(BitBoard h[], const BitBoard q[])
  h[ 0] = ~q[3] & ~q[2] & ~q[1] & ~q[0];
  h[ 1] = ~q[3] & ~q[2] & ~q[1] &  q[0];
  h[ 2] = ~q[3] & ~q[2] &  q[1] & ~q[0];
  h[ 3] = ~q[3] & ~q[2] &  q[1] &  q[0];
  h[ 4] = ~q[3] &  q[2] & ~q[1] & ~q[0];
  h[ 5] = ~q[3] &  q[2] & ~q[1] &  q[0];
  h[ 6] = ~q[3] &  q[2] &  q[1] & ~q[0];
  h[ 7] = ~q[3] &  q[2] &  q[1] &  q[0];
  h[ 8] =  q[3] & ~q[2] & ~q[1] & ~q[0];
  h[ 9] =  q[3] & ~q[2] & ~q[1] &  q[0];
  h[10] =  q[3] & ~q[2] &  q[1] & ~q[0];
  h[11] =  q[3] & ~q[2] &  q[1] &  q[0];
  h[12] =  q[3] &  q[2] & ~q[1] & ~q[0];
  h[13] =  q[3] &  q[2] & ~q[1] &  q[0];
  h[14] =  q[3] &  q[2] &  q[1] & ~q[0];
  h[15] =  q[3] &  q[2] &  q[1] &  q[0];

/* inline assembly ~38 cycles */

void quad2hexBB_A(BitBoard h[], const BitBoard q[])
    mov		eax,  [q]
    pcmpeqd	xmm7, xmm7	; -1
    movdqa	xmm0, [eax+0*16];       q1    :          q0
    movdqa	xmm4, [eax+1*16]; q3          :    q2
    mov		eax,  [h]
    movdqa	xmm5, xmm0
    movdqa	xmm1, xmm0
    por		xmm0, xmm4
    pand	xmm5, xmm4	; q3 &  q1    :    q2 &  q0
    pxor	xmm0, xmm7	;~q3 & ~q1    :   ~q2 & ~q0
    pxor	xmm1, xmm5	;~q3 &  q1    :   ~q2 &  q0
    pxor	xmm4, xmm5	; q3 & ~q1    :    q2 & ~q0
    movdqa	xmm6, xmm0
    movdqa	xmm7, xmm4
    punpckhqdq	xmm0, xmm0	;~q3 & ~q1    :~q3 & ~q1
    punpckhqdq	xmm4, xmm4	; q3 & ~q1    : q3 & ~q1
    punpcklqdq	xmm6, xmm1	;   ~q2 &  q0 :   ~q2 & ~q0
    punpcklqdq	xmm7, xmm5	;    q2 &  q0 :    q2 & ~q0
    punpckhqdq	xmm1, xmm1	;~q3 &  q1    :~q3 &  q1
    punpckhqdq	xmm5, xmm5	; q3 &  q1    : q3 &  q1
    movdqa	xmm2, xmm0
    movdqa	xmm3, xmm1
    pand	xmm0, xmm6	;~q3~q2~q1 q0 :~q3~q2~q1~q0
    pand	xmm1, xmm6	;~q3~q2 q1 q0 :~q3~q2 q1~q0
    pand	xmm2, xmm7	;~q3 q2~q1 q0 :~q3 q2~q1~q0
    pand	xmm3, xmm7	;~q3 q2 q1 q0 :~q3 q2 q1~q0
    movdqa	[eax+0*16], xmm0;  1:0
    movdqa	[eax+1*16], xmm1;  3:2
    movdqa	[eax+2*16], xmm2;  5:4
    movdqa	[eax+3*16], xmm3;  7:6
    movdqa	xmm0, xmm4
    movdqa	xmm1, xmm5
    pand	xmm4, xmm6	; q3~q2~q1 q0 : q3~q2~q1~q0
    pand	xmm5, xmm6	; q3~q2 q1 q0 : q3~q2 q1~q0
    pand	xmm0, xmm7	; q3 q2~q1 q0 : q3 q2~q1~q0
    pand	xmm1, xmm7	; q3 q2 q1 q0 : q3 q2 q1~q0
    movdqa	[eax+4*16], xmm4;  9:8
    movdqa	[eax+5*16], xmm5;  B:A
    movdqa	[eax+6*16], xmm0;  D:C
    movdqa	[eax+7*16], xmm1;  F:E

/* intrinsic-version ~34 cyles */

void quad2hexBB(BitBoard h[], const BitBoard q[])
  __m128i a, b, c, d, e, f;
  __m128i* p = (__m128i*) q;
  a     =  _mm_load_si128(p+0);
  c     =  _mm_load_si128(p+1);
  p = (__m128i*) h;
  b = d =  a;
  a     =  _mm_or_si128  (a, c);
  d     =  _mm_and_si128 (d, c);
  e = a =  _mm_xor_si128 (a, _mm_cmpeq_epi32 (f,f));
  b     =  _mm_xor_si128 (b, d);
  f = c =  _mm_xor_si128 (c, d);
  a = _mm_unpackhi_epi64 (a, a);
  c = _mm_unpackhi_epi64 (c, c);
  e = _mm_unpacklo_epi64 (e, b);
  f = _mm_unpacklo_epi64 (f, d);
  b = _mm_unpackhi_epi64 (b, b);
  d = _mm_unpackhi_epi64 (d, d);
  _mm_store_si128 (p+0, _mm_and_si128 (a, e));
  _mm_store_si128 (p+1, _mm_and_si128 (b, e));
  _mm_store_si128 (p+2, _mm_and_si128 (a, f));
  _mm_store_si128 (p+3, _mm_and_si128 (b, f));
  _mm_store_si128 (p+4, _mm_and_si128 (c, e));
  _mm_store_si128 (p+5, _mm_and_si128 (d, e));
  _mm_store_si128 (p+6, _mm_and_si128 (c, f));
  _mm_store_si128 (p+7, _mm_and_si128 (d, f));

There is one (minor) problem in the code above, using _mm_cmpeq_epi32(f,f) with
not initialized, but same arguments as -1 setter forcing pcmpeqd xmmi,xmmi,
where the initial value of xmmi don't cares. The debug version fires a runtime
exception, but the release version was a few cycles faster, using not
initialized xmm-register variables. Unfortunately there is no -1 setter like
_mm_setzero_si128() for zero - at least i don't find one.


This page took 0.05 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.