Author: Gerd Isenberg
Date: 23:39:16 01/12/05
Go up one level in this thread
On January 12, 2005 at 23:31:34, Aart J.C. Bik wrote:
>Hi Folks,
>This is my first post to this forum. I work in the Intel Compiler Lab on
>automatic vectorization for multimedia extensions (see
>http://www.intelcompiler.com for some details), but in my free time I have
>started working on a chess engine that will be optimized for the
>Streaming-SIMD-extensions (SSE/SSE2/SSE3). Currently I am making my first steps
>on getting my own Universal Chess Interface implementation to work with my
>favorite chess environment: the Fritz interface. I noticed that some things
>seeem to work a little different in Fritz than in the formal description (is
>this a known deviation?), but I am pretty confident I will figure it out.
>Are there other ongoing research projects that try to exploit the
>Streaming-SIMD-extensions? If so, I would like to hear about this. Also,
>although initially I want to focus on the engine, eventually I may also want to
>implement my own book and access to Nalimov tablebases during the search. Stefan
>Meyer-Kahlen kindly pointed me to the Crafty download to figure out how the
>tablebases can be accessed. Will incorporating some Crafty source into my own
>engine have any impact on licensing?
>Nice "meeting you folks" and looking forward to some nice discussions!
>Aart Bik
>http://www.aartbik.com
Hi Aart,
very interesting. I do a let say private "research" using SIMD instructions in
my bitboard based chess program, in my current program MMX inline assembly. But
i am working on an x86-64 approach using SSE2 with own C++ wrapper classes - a
base for memory layout and two derivations, one with SSE2-intrinsics and one
SWAR derivation in plain C++. Both derivations have most operators overloaded.
Main targets of SIMD-Instructions are fill based attack generation using
parallel prefix Kogge-Stone, introduced here some time ago by Steffan Westcott
(you may use the CCC search engine for further information).
For instance i can write Kogge-Stone fills in C++ with template type <XMM> or
<GP>, controlling the register incarnations of local variables for each fill
direction. So one can mix different disjoint direction fills with different
register files.
template <class T> __forceinline
void leftAttacks(sTarget* pTarget, const sSource* pSource)
{
T gl(pSource->rooks);
T pl(pSource->occup);
pl = (~pl).notH();
gl |= pl & (gl>>1);
pl &= pl>>1;
gl |= pl & (gl>>2);
pl &= pl>>2;
gl |= pl & (gl>>4);
(gl>>1).notH().store(&pTarget->le);
}
This is not a domain of SSE2, but SSE2 may be used here to gain some additional
throughtput in conjunction with independent 2*64-bit gp instruction chains.
Another interesting applications using SSE2 in chess programs are IMHO
dot-products of vectors of 32/64-bits with 32/64 bytes or short ints. That might
be usefull for evaluation purposes, eg. to do some weighted population count of
several attack bitboards. Something like this with inline asm:
int dotProduct(BitBoard bb, BYTE weightsOfMax0x3f[] /* XMM_ALIGN */)
{
static const BitBoard XMM_ALIGN bits[4] =
{0x8040201008040201,0x8040201008040201, 0, 0};
__asm
{
movq xmm0, [bb] ; 00000000000000008040201008040201
punpcklbw xmm0, xmm0 ; 80804040202010100808040402020101
lea edx, [bits]
mov eax, [weightsOfMax0x3f]
movdqa xmm2, xmm0
punpcklwd xmm0, xmm0 ; 08080808040404040202020201010101
punpckhwd xmm2, xmm2 ; 80808080404040402020202010101010
movdqa xmm1, xmm0
movdqa xmm3, xmm2
punpckldq xmm0, xmm0 ; 02020202020202020101010101010101
punpckhdq xmm1, xmm1 ; 08080808080808080404040404040404
punpckldq xmm2, xmm2 ; 20202020202020201010101010101010
punpckhdq xmm3, xmm3 ; 80808080808080804040404040404040
pand xmm0, [edx] ; mask the bits
pand xmm1, [edx]
pand xmm2, [edx]
pand xmm3, [edx]
pcmpeqb xmm0, [edx] ; extend bits to bytes
pcmpeqb xmm1, [edx]
pcmpeqb xmm2, [edx]
pcmpeqb xmm3, [edx]
pand xmm0, [eax+0*16] ; multiply by "and" with -1 or 0
pand xmm1, [eax+1*16]
pand xmm2, [eax+2*16]
pand xmm3, [eax+3*16]
paddusb xmm0, xmm1 ; add all bytes (with saturation)
paddusb xmm2, xmm3 ; saturation is used to avoid "accidental" overflows
paddusb xmm0, xmm2 ; eg. 64+64+64+64
psadbw xmm0, [edx+16]; horizontal add 2 * 8 byte
pextrw edx, xmm0, 4 ; extract both intermediate sums to gp
pextrw eax, xmm0, 0
add eax, edx ; final add
}
}
It would be interesting to see how intel compiler may generate such code.
Cheers,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.