Computer Chess Club Archives

Search

Terms

Messages

Subject: Bitboard Serialization

Author: Gerd Isenberg

Date: 11:04:07 12/06/02

Hi All,

Some brainstorming, inspired by the recent postings of Miguel A. Ballicora, to
serialize a bitboard on block, producing a terminated string of target squares.

http://www.talkchess.com/forums/1/message.html?268658

Another point is the kind of traversing bitboards during move generation,
already discussed recently by the bitboard connection. The nested while loops
over piece sets in the outer and attacking target squares in the inner loop.

With Floodfill/Kogge-Stone Attack generation the attacked getter routines need a
bitboard (isolated by b&-b, due i need disjoint target sets, even if there are
possibly smarter approaches by Steffan Westcott) as parameter, rather than a
square index as integer. Is it necessary to do bitscans at all?

As Sune and Russell already mentioned, one need to address the zobrist and
history tables. If one don't use from/to square coordinates, one need a hell
fast hash function (at least faster than two bitscans) that map two isolated
bitboards (difference?) to a quite short value range, in the manner of Walter
Faxon's magig LSB_64.

http://www.talkchess.com/forums/1/message.html?268531

I tried that mod5811, but two parallized mmx-bitscans are clearly faster.

Miguel's algorithm already acts as a kind of move-generator, if we provide some
approriate set of target squares and some preininialized move structure template
(most lokely 32 bits with 6 empty bits in a row). But Miguel needs two nested
file-rank loops, a lot of conditional stuff. The runtime behaviour may be
dependent on the bit positions.

--

How about this MMX,3DNow! (Athlon only) approach, which is able to do a lot in
parallel and avoids most, if not all conditional jumps. The drawback are
possible multiple writes to the same address.

What is more expensive, a conditional jump with the risk of branch misprediction
or a additional write with the same index during writing an array?


Following steps:
--
0. some register initialations
        // mm0 is the bitboards to serialize
	// mm7 5f:7f constant to subtract bit0-offset
        // ecx move list int32-pointer

1. Isolating the bits in both 32-bit dwords.
        // mm0 is the bitboards to serialize
doNscans:
	pxor	mm1, mm1	; 0
	psubd	mm1, mm0	; -bb 32bitwise
	pand	mm1, mm0	; bb & -bb
	pxor	mm0, mm1	; reset isolated bits

2. Two parallel bitscans with base two logarithms via PI2FD.
   If no bit is left anymore in one dword we produce
   a negative result there, where most upper bits set.

	PI2FD	mm1, mm1	; 0, 3f8..,400..
	pslld	mm1, 1		; clear sign bit
	psrld	mm1, 24		; shift 3f8... to 7f (bit 0)
	psubd	mm1, mm7	; - 5f:7f, considers +32 in highword

3. Both dwords in mm1 contain signed square coordinates if >= 0 each.
   Both dwords are moved to memory via register pointer ecx unconditionally.
   But the pointer is incremented by (sq >= 0) rather than one.

	PMOVMSKB eax,mm1	; get sign bits 123,567 !vector path instr.!
        // time to (p)or some other information (source square) into mm1
	not	al		; complement sign bits
	mov	edx, eax
	and	al, 0x04	; 0x04*(ldword >= 0) considers sizeof(int32)
	and	dl, 0x40	; 0x40*(hdword >= 0)
	movd	[ecx], mm1	; write ldword
	add	ecx, eax	; moves += (ldword >= 0);
	punpckhdq mm1, mm1	; hdword, like >>= 32
	shr	edx, 4		; 0x04*(hdword >= 0) considers sizeof(int32)
	movd	[ecx], mm1	; write hdword
	add	ecx, edx	; moves += (hdword >= 0);

4. [optional performance step - loop unrolling]
   To gain more parallelism und and to solve the dependency chains,
   repeat the above sequence one to three times (2-4 in total),
   with mm2...mm5 instead of mm1, and do some clever instruction mixing.

5. If any bits left in mm0, goto 1.)

	pxor	mm1, mm1	; 0
	pcmpeqd	mm1, mm0	; == 0
	PMOVMSKB eax,mm1	; get dwords to nibble
	cmp	eax, 0xff	; all zero ?
	jne	doNscans

This step is even not necessary in a specialized version of this routine for
rook and bishop captures, if step 4 containes 3 unrolled repetitions.
This test may also be done at the beginning of the loop.

6. Implicite popcount (return (ecx - initial pointer)/sizeof(int) ), so no
terminator necessary.

--

There are two sources of parallelism. The first one the parallel SIMD processing
of two dwords per mmx-register. The performance decreases, if both 32bit dwords
have different popcounts. That implies even more dummy writes.

The second one is the use of multiple mmx registers after resetting the previous
isolated one. But this sucks if the target sets are sparely populated. The
relative performance is obviosly depending on the populations count.

Best case with four mmx-registers is finding eight squares per iteration and no
dummy write. Worst case (assume != 0) is only finding one square, but seven
dummy writes. No idea about the penaly of multiple writes to the same address.
Isn't it depending on cache strategy (write through, or dirty tag)?

Due to parallel processing of low and high dwords, the generated target square
order may alternate between the two board halfs.

Of course, i will try it in IsiChess. Bitboard parameter may be passed through
mmx-register from a previous mmx-attack getter. But in IsiChess during capture
generation, target sets are often rarely populated, due to generating captures
to one kind of piece per time. Have to think about...

Cheers,
Gerd

Hash-it-all! Walter Faxon 13:55:55 12/07/02
Re: Bitboard Serialization Anthony Cozzie 13:34:09 12/06/02
- Re: Bitboard Serialization Gerd Isenberg 14:10:30 12/06/02
  - Re: Bitboard Serialization Anthony Cozzie 15:07:10 12/06/02
- Re: Bitboard Serialization Matt Taylor 13:58:20 12/06/02
What kind of memory do i have on the heap or stack? Gerd Isenberg 13:33:24 12/06/02

This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.