Author: Gerd Isenberg
Date: 15:33:36 07/23/04
Is your getNextMove O(N**2)?
Some chess-programs assign score values to generated moves, based on SEE,
history, piece square tables and other heuristics. In upper search regions, far
from the leaves, where move sorting is not a hot spot, one may even use a more
sophisticated scoring for best branching factor.
If a program generates all captures (in quiescense) and all moves near the
leaves and scores it, the getNextMove-routine usually traverses the remaining
moves, looking for the one with the best score.
Even if generating and scoring is more expensive, than picking up the moves in
best order, there are even implementations sorting moves by qsort or similar, to
avoid further overhead. Since early cutoffs may occur, that seems not to be a
good idea, copying moves around for nothing. But at some frontier all-nodes,
comparing scores during getNextMove, one may notice the O(N**2) behaviour due to
N/2*(N-1) compares with some memory traffic. "Random" score compares may lead to
some miss-predictions as well, oups there is a better one.
OTOH there are all-Nodes with only "bad" moves, where sorting doesn't matter at
all. And there are approaches to generate moves in stages, e.g. hashmove and
winning captures and checks...
Anyway, inspired by the recent max-discussion and my SIMD obsession i like to
share some brain storming, specially intended for fast bean counters, where
profiling indicates a few percent in picking up the best move from a sorted move
list ;-)
------------------------------------------------------------------------------
SIMD to the max or how to get getNextMove O(N) ?
Simply by a "brute force" SSE2-SIMD approach with k*N*64 for N <= 64 compares
allways, instead of K*N/2*(N-1), but with k << K.
Is that really worth?
If you prefer portability and no assembly or intrinsics at all - no, stop
reading! If you plan to write a dedicated program for architectures with vector
instructions, like SSE2, AltiVec and Itanium - probably yes, at least some fun.
Following routine (P4,AMD64) finds the maximum signed 16-bit number (short int)
in an array of 64, rather fast. Thanks to the PMAXSW - Packed Maximum Signed
Words - instruction. Something like 9.5ns best case (21 cycles or a few more) on
my Athlon64 2.2GHz.
int getMaxOf64(short int s64[] /* aligned 16 */ )
{
__asm
{
mov eax, [s64] ; or rax in 64-bit mode
movdqa xmm1, [eax+0*16] ; short 0- 7
movdqa xmm0, [eax+1*16] ; short 8-15
pmaxsw xmm1, [eax+2*16] ; short 16-23
pmaxsw xmm0, [eax+3*16] ; short 24-31
pmaxsw xmm1, [eax+4*16] ; short 32-39
pmaxsw xmm0, [eax+5*16] ; short 40-47
pmaxsw xmm1, [eax+6*16] ; short 48-55
pmaxsw xmm0, [eax+7*16] ; short 56-63
pmaxsw xmm0, xmm1
; horizontal max of all eight words
movdqa xmm1, xmm0
psrldq xmm0, 2 ; one word right
pmaxsw xmm0, xmm1
movdqa xmm1, xmm0
psrldq xmm0, 4 ; two words right
pmaxsw xmm0, xmm1
movdqa xmm1, xmm0
psrldq xmm0, 8 ; four words right
pmaxsw xmm0, xmm1
pextrw eax, xmm0, 0 ; extract the maximum word
cwde ; sign extend 16 to 32 bits
}
}
What about the idea, to get the best index from up to 64 generated and scored
(SEE, history, psqt...) moves. The move list of this approach consists of two
arrays, allocated on the heap or global ones. One contain the moves (16..128bit
per move, depending on the implementation), the other the associated move-score
and bookholding information (signed 16 bit).
Each node contains an at least 16-bit aligned start "nextScore" pointer or an
appropriate index inside the move list, to index all generated move-scores and
associated moves of this node.
The upper ten bits are the signed move score with a range of -512..+511, since
the six remaining lower bits contain the index itself. Therefore, with two equal
scores the one with the "higher" index becoms best.
Each time after moves are generated, the score array gap to 64 is preinitialized
with 0x8000, taking care whether 64 moves are enough, and eventually
"allocating" a bigger range. Thus 0x8000 == Int16min, indicates no generated
move or move is already made.
If the max function leaves a best score greater Int16min, the correspondent
local index (0..63) of the move inside the move list is available in the lower 6
bits of the best score.
To get the idea, see following code fragments,
which hides the implementation as well:
while ( (move = getNextMove() ).isMove() )
{
makeMove(move);
....
}
with
TMOVE getNextMove()
{
if ( hasMoreThan64 )
{ // special handling very seldom
return getNextMoveMoreThan64();
}
// default case <= 64 moves most often predicted correctly
int idx = getMaxOf64(&moveScores[nextScore]);
if ( idx > -512*64 )
{ // move found with max score
moveScores[nextScore + (idx&63)] = -512*64; // tag as done
return moveList[nextMove + (idx&63)];
}
return nilMove;
}
As you can see, there is no bookholding overhead, other than storing int16min
inside it's score array element. No further move swapping is necessary.
Cheers,
Gerd
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.