Author: Gerd Isenberg
Date: 11:04:07 12/06/02
Hi All,
Some brainstorming, inspired by the recent postings of Miguel A. Ballicora, to
serialize a bitboard on block, producing a terminated string of target squares.
http://www.talkchess.com/forums/1/message.html?268658
Another point is the kind of traversing bitboards during move generation,
already discussed recently by the bitboard connection. The nested while loops
over piece sets in the outer and attacking target squares in the inner loop.
With Floodfill/Kogge-Stone Attack generation the attacked getter routines need a
bitboard (isolated by b&-b, due i need disjoint target sets, even if there are
possibly smarter approaches by Steffan Westcott) as parameter, rather than a
square index as integer. Is it necessary to do bitscans at all?
As Sune and Russell already mentioned, one need to address the zobrist and
history tables. If one don't use from/to square coordinates, one need a hell
fast hash function (at least faster than two bitscans) that map two isolated
bitboards (difference?) to a quite short value range, in the manner of Walter
Faxon's magig LSB_64.
http://www.talkchess.com/forums/1/message.html?268531
I tried that mod5811, but two parallized mmx-bitscans are clearly faster.
Miguel's algorithm already acts as a kind of move-generator, if we provide some
approriate set of target squares and some preininialized move structure template
(most lokely 32 bits with 6 empty bits in a row). But Miguel needs two nested
file-rank loops, a lot of conditional stuff. The runtime behaviour may be
dependent on the bit positions.
--
How about this MMX,3DNow! (Athlon only) approach, which is able to do a lot in
parallel and avoids most, if not all conditional jumps. The drawback are
possible multiple writes to the same address.
What is more expensive, a conditional jump with the risk of branch misprediction
or a additional write with the same index during writing an array?
Following steps:
--
0. some register initialations
// mm0 is the bitboards to serialize
// mm7 5f:7f constant to subtract bit0-offset
// ecx move list int32-pointer
1. Isolating the bits in both 32-bit dwords.
// mm0 is the bitboards to serialize
doNscans:
pxor mm1, mm1 ; 0
psubd mm1, mm0 ; -bb 32bitwise
pand mm1, mm0 ; bb & -bb
pxor mm0, mm1 ; reset isolated bits
2. Two parallel bitscans with base two logarithms via PI2FD.
If no bit is left anymore in one dword we produce
a negative result there, where most upper bits set.
PI2FD mm1, mm1 ; 0, 3f8..,400..
pslld mm1, 1 ; clear sign bit
psrld mm1, 24 ; shift 3f8... to 7f (bit 0)
psubd mm1, mm7 ; - 5f:7f, considers +32 in highword
3. Both dwords in mm1 contain signed square coordinates if >= 0 each.
Both dwords are moved to memory via register pointer ecx unconditionally.
But the pointer is incremented by (sq >= 0) rather than one.
PMOVMSKB eax,mm1 ; get sign bits 123,567 !vector path instr.!
// time to (p)or some other information (source square) into mm1
not al ; complement sign bits
mov edx, eax
and al, 0x04 ; 0x04*(ldword >= 0) considers sizeof(int32)
and dl, 0x40 ; 0x40*(hdword >= 0)
movd [ecx], mm1 ; write ldword
add ecx, eax ; moves += (ldword >= 0);
punpckhdq mm1, mm1 ; hdword, like >>= 32
shr edx, 4 ; 0x04*(hdword >= 0) considers sizeof(int32)
movd [ecx], mm1 ; write hdword
add ecx, edx ; moves += (hdword >= 0);
4. [optional performance step - loop unrolling]
To gain more parallelism und and to solve the dependency chains,
repeat the above sequence one to three times (2-4 in total),
with mm2...mm5 instead of mm1, and do some clever instruction mixing.
5. If any bits left in mm0, goto 1.)
pxor mm1, mm1 ; 0
pcmpeqd mm1, mm0 ; == 0
PMOVMSKB eax,mm1 ; get dwords to nibble
cmp eax, 0xff ; all zero ?
jne doNscans
This step is even not necessary in a specialized version of this routine for
rook and bishop captures, if step 4 containes 3 unrolled repetitions.
This test may also be done at the beginning of the loop.
6. Implicite popcount (return (ecx - initial pointer)/sizeof(int) ), so no
terminator necessary.
--
There are two sources of parallelism. The first one the parallel SIMD processing
of two dwords per mmx-register. The performance decreases, if both 32bit dwords
have different popcounts. That implies even more dummy writes.
The second one is the use of multiple mmx registers after resetting the previous
isolated one. But this sucks if the target sets are sparely populated. The
relative performance is obviosly depending on the populations count.
Best case with four mmx-registers is finding eight squares per iteration and no
dummy write. Worst case (assume != 0) is only finding one square, but seven
dummy writes. No idea about the penaly of multiple writes to the same address.
Isn't it depending on cache strategy (write through, or dirty tag)?
Due to parallel processing of low and high dwords, the generated target square
order may alternate between the two board halfs.
Of course, i will try it in IsiChess. Bitboard parameter may be passed through
mmx-register from a previous mmx-attack getter. But in IsiChess during capture
generation, target sets are often rarely populated, due to generating captures
to one kind of piece per time. Have to think about...
Cheers,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.