Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitboard Serialization

Author: Gerd Isenberg

Date: 14:10:30 12/06/02

Go up one level in this thread


On December 06, 2002 at 16:34:09, Anthony Cozzie wrote:

>On December 06, 2002 at 14:04:07, Gerd Isenberg wrote:
>
>>Hi All,
>>
>>Some brainstorming, inspired by the recent postings of Miguel A. Ballicora, to
>>serialize a bitboard on block, producing a terminated string of target squares.
>>
>>http://www.talkchess.com/forums/1/message.html?268658
>>
>>Another point is the kind of traversing bitboards during move generation,
>>already discussed recently by the bitboard connection. The nested while loops
>>over piece sets in the outer and attacking target squares in the inner loop.
>>
>>With Floodfill/Kogge-Stone Attack generation the attacked getter routines need a
>>bitboard (isolated by b&-b, due i need disjoint target sets, even if there are
>>possibly smarter approaches by Steffan Westcott) as parameter, rather than a
>>square index as integer. Is it necessary to do bitscans at all?
>>
>>As Sune and Russell already mentioned, one need to address the zobrist and
>>history tables. If one don't use from/to square coordinates, one need a hell
>>fast hash function (at least faster than two bitscans) that map two isolated
>>bitboards (difference?) to a quite short value range, in the manner of Walter
>>Faxon's magig LSB_64.
>>
>>http://www.talkchess.com/forums/1/message.html?268531
>>
>>I tried that mod5811, but two parallized mmx-bitscans are clearly faster.
>>
>>Miguel's algorithm already acts as a kind of move-generator, if we provide some
>>approriate set of target squares and some preininialized move structure template
>>(most lokely 32 bits with 6 empty bits in a row). But Miguel needs two nested
>>file-rank loops, a lot of conditional stuff. The runtime behaviour may be
>>dependent on the bit positions.
>>
>>--
>>
>>How about this MMX,3DNow! (Athlon only) approach, which is able to do a lot in
>>parallel and avoids most, if not all conditional jumps. The drawback are
>>possible multiple writes to the same address.
>>
>>What is more expensive, a conditional jump with the risk of branch misprediction
>>or a additional write with the same index during writing an array?
>>
>>
>>Following steps:
>>--
>>0. some register initialations
>>        // mm0 is the bitboards to serialize
>>	// mm7 5f:7f constant to subtract bit0-offset
>>        // ecx move list int32-pointer
>>
>>1. Isolating the bits in both 32-bit dwords.
>>        // mm0 is the bitboards to serialize
>>doNscans:
>>	pxor	mm1, mm1	; 0
>>	psubd	mm1, mm0	; -bb 32bitwise
>>	pand	mm1, mm0	; bb & -bb
>>	pxor	mm0, mm1	; reset isolated bits
>>
>>2. Two parallel bitscans with base two logarithms via PI2FD.
>>   If no bit is left anymore in one dword we produce
>>   a negative result there, where most upper bits set.
>>
>>	PI2FD	mm1, mm1	; 0, 3f8..,400..
>>	pslld	mm1, 1		; clear sign bit
>>	psrld	mm1, 24		; shift 3f8... to 7f (bit 0)
>>	psubd	mm1, mm7	; - 5f:7f, considers +32 in highword
>>
>>3. Both dwords in mm1 contain signed square coordinates if >= 0 each.
>>   Both dwords are moved to memory via register pointer ecx unconditionally.
>>   But the pointer is incremented by (sq >= 0) rather than one.
>>
>>	PMOVMSKB eax,mm1	; get sign bits 123,567 !vector path instr.!
>>        // time to (p)or some other information (source square) into mm1
>>	not	al		; complement sign bits
>>	mov	edx, eax
>>	and	al, 0x04	; 0x04*(ldword >= 0) considers sizeof(int32)
>>	and	dl, 0x40	; 0x40*(hdword >= 0)
>>	movd	[ecx], mm1	; write ldword
>>	add	ecx, eax	; moves += (ldword >= 0);
>>	punpckhdq mm1, mm1	; hdword, like >>= 32
>>	shr	edx, 4		; 0x04*(hdword >= 0) considers sizeof(int32)
>>	movd	[ecx], mm1	; write hdword
>>	add	ecx, edx	; moves += (hdword >= 0);
>>
>>4. [optional performance step - loop unrolling]
>>   To gain more parallelism und and to solve the dependency chains,
>>   repeat the above sequence one to three times (2-4 in total),
>>   with mm2...mm5 instead of mm1, and do some clever instruction mixing.
>>
>>5. If any bits left in mm0, goto 1.)
>>
>>	pxor	mm1, mm1	; 0
>>	pcmpeqd	mm1, mm0	; == 0
>>	PMOVMSKB eax,mm1	; get dwords to nibble
>>	cmp	eax, 0xff	; all zero ?
>>	jne	doNscans
>>
>>This step is even not necessary in a specialized version of this routine for
>>rook and bishop captures, if step 4 containes 3 unrolled repetitions.
>>This test may also be done at the beginning of the loop.
>>
>>6. Implicite popcount (return (ecx - initial pointer)/sizeof(int) ), so no
>>terminator necessary.
>>
>>--
>>
>>There are two sources of parallelism. The first one the parallel SIMD processing
>>of two dwords per mmx-register. The performance decreases, if both 32bit dwords
>>have different popcounts. That implies even more dummy writes.
>>
>>The second one is the use of multiple mmx registers after resetting the previous
>>isolated one. But this sucks if the target sets are sparely populated. The
>>relative performance is obviosly depending on the populations count.
>>
>>Best case with four mmx-registers is finding eight squares per iteration and no
>>dummy write. Worst case (assume != 0) is only finding one square, but seven
>>dummy writes. No idea about the penaly of multiple writes to the same address.
>>Isn't it depending on cache strategy (write through, or dirty tag)?
>>
>>Due to parallel processing of low and high dwords, the generated target square
>>order may alternate between the two board halfs.
>>
>>Of course, i will try it in IsiChess. Bitboard parameter may be passed through
>>mmx-register from a previous mmx-attack getter. But in IsiChess during capture
>>generation, target sets are often rarely populated, due to generating captures
>>to one kind of piece per time. Have to think about...
>>
>>Cheers,
>>Gerd
>
>
>well, I had a similar idea and implemented it with the following code (based on
>Walter's algorithm):
>
>void add_to_move_buffer(bitboard a, int initial_square, zmove *mbuf, int
>*buff_pos)
>{
>  bitboard t1, t2;
>  zappa_u32 a32, b32, unroll = 0, cont = a.l[0] | a.l[1];
>  zmove m;
>
>  if(cont == 0)
>    return;
>
>  m.m.prom_to = 0;
>  m.m.in_loc = initial_square;
>
>  while(cont)
>  {
>    //copy space
>    mbuf[*buff_pos].i = m.i;
>    mbuf[*buff_pos + 1].i = m.i;
>
>    //pull off two leading ones (sequential)
>    t1.i = a.i - 1;
>    a.i &= t1.i;                      //remove high bit
>    t1.i ^= a.i;                      //isolate high bit
>    a32 = t1.l[1] ^ t1.l[0]; //first bit
>    unroll = a.l[1] | a.l[0];
>
>    t2.i = a.i - 1;
>    a.i &= t2.i;
>    t2.i ^= a.i;
>    b32 = t2.l[1] ^ t2.l[0]; //second bit
>    cont = a.l[1] | a.l[0];
>
>    //interleave folding and tb lookup
>    a32 ^= LSB_64_magic;
>    b32 ^= LSB_64_magic;
>    a32 += a32 >> 16;
>    b32 += b32 >> 16;
>    a32 -= a32 >> 8;
>    b32 -= b32 >> 8;
>    a32 = LSB_64_table[LSB_64_adj + (zappa_u8)a32];
>    b32 = LSB_64_table[LSB_64_adj + (zappa_u8)b32];
>
>    //write data
>    mbuf[*buff_pos].m.f_loc = a32;
>    mbuf[*buff_pos+1].m.f_loc = b32;
>    *buff_pos += 2;
>  }
>}
>
>This reduced zappa's NPS by about 5%.
>
>As far as I can tell, the problem is that the array is so sparse.  For example,
>a Knight will have maybe 3 or 4 moves in the middlegame, a King 2-3, etc.  Also,
>there is a good chance that they will all be in the same half of the bitboard,
>so I think that doing the upper and lower halves of the board in parallel will
>probably not make much of a difference.
>
>What I am going to try next (after I finish my 5 projects due by thursday) is
>doing two of these move generations in parallel, that is doing the
>bitboard->movelist for two knights in parallel, etc.
>
>anthony

Hi Anthony,

That's interesting, Walter's routine is really great.

Yes the problems are the sparely populated bitboards.

The mmx-stuff makes high- and low-board in parallel due to SIMD-instructions.
The Athlon(XP?) makes up to four register independent mmx-instruction
simultaniously. Therefor step 4.) is really required.

Another idea is to do some BxQ and BxR in parallel without any conditional
jumps, except an initial "if ( bb != 0 )", or may be not even that.

Similar SSE2-instructions are also available for P4 and Hammer of course.
But Athlon is about twice as fast as P4 with this independent mmx-stuff.

btw. maximum iteration count on a halfboard per piece:
            (captures)
Rook    10  (4)
Bishop   7  (4)
Queen   17  (8)
Knight   6  (6)
King     8  (8!)

Regards,
Gerd





This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.