Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitboard Serialization

Author: Anthony Cozzie

Date: 13:34:09 12/06/02

Go up one level in this thread


On December 06, 2002 at 14:04:07, Gerd Isenberg wrote:

>Hi All,
>
>Some brainstorming, inspired by the recent postings of Miguel A. Ballicora, to
>serialize a bitboard on block, producing a terminated string of target squares.
>
>http://www.talkchess.com/forums/1/message.html?268658
>
>Another point is the kind of traversing bitboards during move generation,
>already discussed recently by the bitboard connection. The nested while loops
>over piece sets in the outer and attacking target squares in the inner loop.
>
>With Floodfill/Kogge-Stone Attack generation the attacked getter routines need a
>bitboard (isolated by b&-b, due i need disjoint target sets, even if there are
>possibly smarter approaches by Steffan Westcott) as parameter, rather than a
>square index as integer. Is it necessary to do bitscans at all?
>
>As Sune and Russell already mentioned, one need to address the zobrist and
>history tables. If one don't use from/to square coordinates, one need a hell
>fast hash function (at least faster than two bitscans) that map two isolated
>bitboards (difference?) to a quite short value range, in the manner of Walter
>Faxon's magig LSB_64.
>
>http://www.talkchess.com/forums/1/message.html?268531
>
>I tried that mod5811, but two parallized mmx-bitscans are clearly faster.
>
>Miguel's algorithm already acts as a kind of move-generator, if we provide some
>approriate set of target squares and some preininialized move structure template
>(most lokely 32 bits with 6 empty bits in a row). But Miguel needs two nested
>file-rank loops, a lot of conditional stuff. The runtime behaviour may be
>dependent on the bit positions.
>
>--
>
>How about this MMX,3DNow! (Athlon only) approach, which is able to do a lot in
>parallel and avoids most, if not all conditional jumps. The drawback are
>possible multiple writes to the same address.
>
>What is more expensive, a conditional jump with the risk of branch misprediction
>or a additional write with the same index during writing an array?
>
>
>Following steps:
>--
>0. some register initialations
>        // mm0 is the bitboards to serialize
>	// mm7 5f:7f constant to subtract bit0-offset
>        // ecx move list int32-pointer
>
>1. Isolating the bits in both 32-bit dwords.
>        // mm0 is the bitboards to serialize
>doNscans:
>	pxor	mm1, mm1	; 0
>	psubd	mm1, mm0	; -bb 32bitwise
>	pand	mm1, mm0	; bb & -bb
>	pxor	mm0, mm1	; reset isolated bits
>
>2. Two parallel bitscans with base two logarithms via PI2FD.
>   If no bit is left anymore in one dword we produce
>   a negative result there, where most upper bits set.
>
>	PI2FD	mm1, mm1	; 0, 3f8..,400..
>	pslld	mm1, 1		; clear sign bit
>	psrld	mm1, 24		; shift 3f8... to 7f (bit 0)
>	psubd	mm1, mm7	; - 5f:7f, considers +32 in highword
>
>3. Both dwords in mm1 contain signed square coordinates if >= 0 each.
>   Both dwords are moved to memory via register pointer ecx unconditionally.
>   But the pointer is incremented by (sq >= 0) rather than one.
>
>	PMOVMSKB eax,mm1	; get sign bits 123,567 !vector path instr.!
>        // time to (p)or some other information (source square) into mm1
>	not	al		; complement sign bits
>	mov	edx, eax
>	and	al, 0x04	; 0x04*(ldword >= 0) considers sizeof(int32)
>	and	dl, 0x40	; 0x40*(hdword >= 0)
>	movd	[ecx], mm1	; write ldword
>	add	ecx, eax	; moves += (ldword >= 0);
>	punpckhdq mm1, mm1	; hdword, like >>= 32
>	shr	edx, 4		; 0x04*(hdword >= 0) considers sizeof(int32)
>	movd	[ecx], mm1	; write hdword
>	add	ecx, edx	; moves += (hdword >= 0);
>
>4. [optional performance step - loop unrolling]
>   To gain more parallelism und and to solve the dependency chains,
>   repeat the above sequence one to three times (2-4 in total),
>   with mm2...mm5 instead of mm1, and do some clever instruction mixing.
>
>5. If any bits left in mm0, goto 1.)
>
>	pxor	mm1, mm1	; 0
>	pcmpeqd	mm1, mm0	; == 0
>	PMOVMSKB eax,mm1	; get dwords to nibble
>	cmp	eax, 0xff	; all zero ?
>	jne	doNscans
>
>This step is even not necessary in a specialized version of this routine for
>rook and bishop captures, if step 4 containes 3 unrolled repetitions.
>This test may also be done at the beginning of the loop.
>
>6. Implicite popcount (return (ecx - initial pointer)/sizeof(int) ), so no
>terminator necessary.
>
>--
>
>There are two sources of parallelism. The first one the parallel SIMD processing
>of two dwords per mmx-register. The performance decreases, if both 32bit dwords
>have different popcounts. That implies even more dummy writes.
>
>The second one is the use of multiple mmx registers after resetting the previous
>isolated one. But this sucks if the target sets are sparely populated. The
>relative performance is obviosly depending on the populations count.
>
>Best case with four mmx-registers is finding eight squares per iteration and no
>dummy write. Worst case (assume != 0) is only finding one square, but seven
>dummy writes. No idea about the penaly of multiple writes to the same address.
>Isn't it depending on cache strategy (write through, or dirty tag)?
>
>Due to parallel processing of low and high dwords, the generated target square
>order may alternate between the two board halfs.
>
>Of course, i will try it in IsiChess. Bitboard parameter may be passed through
>mmx-register from a previous mmx-attack getter. But in IsiChess during capture
>generation, target sets are often rarely populated, due to generating captures
>to one kind of piece per time. Have to think about...
>
>Cheers,
>Gerd


well, I had a similar idea and implemented it with the following code (based on
Walter's algorithm):

void add_to_move_buffer(bitboard a, int initial_square, zmove *mbuf, int
*buff_pos)
{
  bitboard t1, t2;
  zappa_u32 a32, b32, unroll = 0, cont = a.l[0] | a.l[1];
  zmove m;

  if(cont == 0)
    return;

  m.m.prom_to = 0;
  m.m.in_loc = initial_square;

  while(cont)
  {
    //copy space
    mbuf[*buff_pos].i = m.i;
    mbuf[*buff_pos + 1].i = m.i;

    //pull off two leading ones (sequential)
    t1.i = a.i - 1;
    a.i &= t1.i;                      //remove high bit
    t1.i ^= a.i;                      //isolate high bit
    a32 = t1.l[1] ^ t1.l[0]; //first bit
    unroll = a.l[1] | a.l[0];

    t2.i = a.i - 1;
    a.i &= t2.i;
    t2.i ^= a.i;
    b32 = t2.l[1] ^ t2.l[0]; //second bit
    cont = a.l[1] | a.l[0];

    //interleave folding and tb lookup
    a32 ^= LSB_64_magic;
    b32 ^= LSB_64_magic;
    a32 += a32 >> 16;
    b32 += b32 >> 16;
    a32 -= a32 >> 8;
    b32 -= b32 >> 8;
    a32 = LSB_64_table[LSB_64_adj + (zappa_u8)a32];
    b32 = LSB_64_table[LSB_64_adj + (zappa_u8)b32];

    //write data
    mbuf[*buff_pos].m.f_loc = a32;
    mbuf[*buff_pos+1].m.f_loc = b32;
    *buff_pos += 2;
  }
}

This reduced zappa's NPS by about 5%.

As far as I can tell, the problem is that the array is so sparse.  For example,
a Knight will have maybe 3 or 4 moves in the middlegame, a King 2-3, etc.  Also,
there is a good chance that they will all be in the same half of the bitboard,
so I think that doing the upper and lower halves of the board in parallel will
probably not make much of a difference.

What I am going to try next (after I finish my 5 projects due by thursday) is
doing two of these move generations in parallel, that is doing the
bitboard->movelist for two knights in parallel, etc.

anthony



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.