Author: Anthony Cozzie
Date: 13:34:09 12/06/02
Go up one level in this thread
On December 06, 2002 at 14:04:07, Gerd Isenberg wrote:
>Hi All,
>
>Some brainstorming, inspired by the recent postings of Miguel A. Ballicora, to
>serialize a bitboard on block, producing a terminated string of target squares.
>
>http://www.talkchess.com/forums/1/message.html?268658
>
>Another point is the kind of traversing bitboards during move generation,
>already discussed recently by the bitboard connection. The nested while loops
>over piece sets in the outer and attacking target squares in the inner loop.
>
>With Floodfill/Kogge-Stone Attack generation the attacked getter routines need a
>bitboard (isolated by b&-b, due i need disjoint target sets, even if there are
>possibly smarter approaches by Steffan Westcott) as parameter, rather than a
>square index as integer. Is it necessary to do bitscans at all?
>
>As Sune and Russell already mentioned, one need to address the zobrist and
>history tables. If one don't use from/to square coordinates, one need a hell
>fast hash function (at least faster than two bitscans) that map two isolated
>bitboards (difference?) to a quite short value range, in the manner of Walter
>Faxon's magig LSB_64.
>
>http://www.talkchess.com/forums/1/message.html?268531
>
>I tried that mod5811, but two parallized mmx-bitscans are clearly faster.
>
>Miguel's algorithm already acts as a kind of move-generator, if we provide some
>approriate set of target squares and some preininialized move structure template
>(most lokely 32 bits with 6 empty bits in a row). But Miguel needs two nested
>file-rank loops, a lot of conditional stuff. The runtime behaviour may be
>dependent on the bit positions.
>
>--
>
>How about this MMX,3DNow! (Athlon only) approach, which is able to do a lot in
>parallel and avoids most, if not all conditional jumps. The drawback are
>possible multiple writes to the same address.
>
>What is more expensive, a conditional jump with the risk of branch misprediction
>or a additional write with the same index during writing an array?
>
>
>Following steps:
>--
>0. some register initialations
> // mm0 is the bitboards to serialize
> // mm7 5f:7f constant to subtract bit0-offset
> // ecx move list int32-pointer
>
>1. Isolating the bits in both 32-bit dwords.
> // mm0 is the bitboards to serialize
>doNscans:
> pxor mm1, mm1 ; 0
> psubd mm1, mm0 ; -bb 32bitwise
> pand mm1, mm0 ; bb & -bb
> pxor mm0, mm1 ; reset isolated bits
>
>2. Two parallel bitscans with base two logarithms via PI2FD.
> If no bit is left anymore in one dword we produce
> a negative result there, where most upper bits set.
>
> PI2FD mm1, mm1 ; 0, 3f8..,400..
> pslld mm1, 1 ; clear sign bit
> psrld mm1, 24 ; shift 3f8... to 7f (bit 0)
> psubd mm1, mm7 ; - 5f:7f, considers +32 in highword
>
>3. Both dwords in mm1 contain signed square coordinates if >= 0 each.
> Both dwords are moved to memory via register pointer ecx unconditionally.
> But the pointer is incremented by (sq >= 0) rather than one.
>
> PMOVMSKB eax,mm1 ; get sign bits 123,567 !vector path instr.!
> // time to (p)or some other information (source square) into mm1
> not al ; complement sign bits
> mov edx, eax
> and al, 0x04 ; 0x04*(ldword >= 0) considers sizeof(int32)
> and dl, 0x40 ; 0x40*(hdword >= 0)
> movd [ecx], mm1 ; write ldword
> add ecx, eax ; moves += (ldword >= 0);
> punpckhdq mm1, mm1 ; hdword, like >>= 32
> shr edx, 4 ; 0x04*(hdword >= 0) considers sizeof(int32)
> movd [ecx], mm1 ; write hdword
> add ecx, edx ; moves += (hdword >= 0);
>
>4. [optional performance step - loop unrolling]
> To gain more parallelism und and to solve the dependency chains,
> repeat the above sequence one to three times (2-4 in total),
> with mm2...mm5 instead of mm1, and do some clever instruction mixing.
>
>5. If any bits left in mm0, goto 1.)
>
> pxor mm1, mm1 ; 0
> pcmpeqd mm1, mm0 ; == 0
> PMOVMSKB eax,mm1 ; get dwords to nibble
> cmp eax, 0xff ; all zero ?
> jne doNscans
>
>This step is even not necessary in a specialized version of this routine for
>rook and bishop captures, if step 4 containes 3 unrolled repetitions.
>This test may also be done at the beginning of the loop.
>
>6. Implicite popcount (return (ecx - initial pointer)/sizeof(int) ), so no
>terminator necessary.
>
>--
>
>There are two sources of parallelism. The first one the parallel SIMD processing
>of two dwords per mmx-register. The performance decreases, if both 32bit dwords
>have different popcounts. That implies even more dummy writes.
>
>The second one is the use of multiple mmx registers after resetting the previous
>isolated one. But this sucks if the target sets are sparely populated. The
>relative performance is obviosly depending on the populations count.
>
>Best case with four mmx-registers is finding eight squares per iteration and no
>dummy write. Worst case (assume != 0) is only finding one square, but seven
>dummy writes. No idea about the penaly of multiple writes to the same address.
>Isn't it depending on cache strategy (write through, or dirty tag)?
>
>Due to parallel processing of low and high dwords, the generated target square
>order may alternate between the two board halfs.
>
>Of course, i will try it in IsiChess. Bitboard parameter may be passed through
>mmx-register from a previous mmx-attack getter. But in IsiChess during capture
>generation, target sets are often rarely populated, due to generating captures
>to one kind of piece per time. Have to think about...
>
>Cheers,
>Gerd
well, I had a similar idea and implemented it with the following code (based on
Walter's algorithm):
void add_to_move_buffer(bitboard a, int initial_square, zmove *mbuf, int
*buff_pos)
{
bitboard t1, t2;
zappa_u32 a32, b32, unroll = 0, cont = a.l[0] | a.l[1];
zmove m;
if(cont == 0)
return;
m.m.prom_to = 0;
m.m.in_loc = initial_square;
while(cont)
{
//copy space
mbuf[*buff_pos].i = m.i;
mbuf[*buff_pos + 1].i = m.i;
//pull off two leading ones (sequential)
t1.i = a.i - 1;
a.i &= t1.i; //remove high bit
t1.i ^= a.i; //isolate high bit
a32 = t1.l[1] ^ t1.l[0]; //first bit
unroll = a.l[1] | a.l[0];
t2.i = a.i - 1;
a.i &= t2.i;
t2.i ^= a.i;
b32 = t2.l[1] ^ t2.l[0]; //second bit
cont = a.l[1] | a.l[0];
//interleave folding and tb lookup
a32 ^= LSB_64_magic;
b32 ^= LSB_64_magic;
a32 += a32 >> 16;
b32 += b32 >> 16;
a32 -= a32 >> 8;
b32 -= b32 >> 8;
a32 = LSB_64_table[LSB_64_adj + (zappa_u8)a32];
b32 = LSB_64_table[LSB_64_adj + (zappa_u8)b32];
//write data
mbuf[*buff_pos].m.f_loc = a32;
mbuf[*buff_pos+1].m.f_loc = b32;
*buff_pos += 2;
}
}
This reduced zappa's NPS by about 5%.
As far as I can tell, the problem is that the array is so sparse. For example,
a Knight will have maybe 3 or 4 moves in the middlegame, a King 2-3, etc. Also,
there is a good chance that they will all be in the same half of the bitboard,
so I think that doing the upper and lower halves of the board in parallel will
probably not make much of a difference.
What I am going to try next (after I finish my 5 projects due by thursday) is
doing two of these move generations in parallel, that is doing the
bitboard->movelist for two knights in parallel, etc.
anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.