Author: Matt Taylor
Date: 13:58:20 12/06/02
Go up one level in this thread
On December 06, 2002 at 16:34:09, Anthony Cozzie wrote:
>On December 06, 2002 at 14:04:07, Gerd Isenberg wrote:
>
>>Hi All,
>>
>>Some brainstorming, inspired by the recent postings of Miguel A. Ballicora, to
>>serialize a bitboard on block, producing a terminated string of target squares.
>>
>>http://www.talkchess.com/forums/1/message.html?268658
>>
>>Another point is the kind of traversing bitboards during move generation,
>>already discussed recently by the bitboard connection. The nested while loops
>>over piece sets in the outer and attacking target squares in the inner loop.
>>
>>With Floodfill/Kogge-Stone Attack generation the attacked getter routines need a
>>bitboard (isolated by b&-b, due i need disjoint target sets, even if there are
>>possibly smarter approaches by Steffan Westcott) as parameter, rather than a
>>square index as integer. Is it necessary to do bitscans at all?
>>
>>As Sune and Russell already mentioned, one need to address the zobrist and
>>history tables. If one don't use from/to square coordinates, one need a hell
>>fast hash function (at least faster than two bitscans) that map two isolated
>>bitboards (difference?) to a quite short value range, in the manner of Walter
>>Faxon's magig LSB_64.
>>
>>http://www.talkchess.com/forums/1/message.html?268531
>>
>>I tried that mod5811, but two parallized mmx-bitscans are clearly faster.
>>
>>Miguel's algorithm already acts as a kind of move-generator, if we provide some
>>approriate set of target squares and some preininialized move structure template
>>(most lokely 32 bits with 6 empty bits in a row). But Miguel needs two nested
>>file-rank loops, a lot of conditional stuff. The runtime behaviour may be
>>dependent on the bit positions.
>>
>>--
>>
>>How about this MMX,3DNow! (Athlon only) approach, which is able to do a lot in
>>parallel and avoids most, if not all conditional jumps. The drawback are
>>possible multiple writes to the same address.
>>
>>What is more expensive, a conditional jump with the risk of branch misprediction
>>or a additional write with the same index during writing an array?
>>
>>
>>Following steps:
>>--
>>0. some register initialations
>> // mm0 is the bitboards to serialize
>> // mm7 5f:7f constant to subtract bit0-offset
>> // ecx move list int32-pointer
>>
>>1. Isolating the bits in both 32-bit dwords.
>> // mm0 is the bitboards to serialize
>>doNscans:
>> pxor mm1, mm1 ; 0
>> psubd mm1, mm0 ; -bb 32bitwise
>> pand mm1, mm0 ; bb & -bb
>> pxor mm0, mm1 ; reset isolated bits
>>
>>2. Two parallel bitscans with base two logarithms via PI2FD.
>> If no bit is left anymore in one dword we produce
>> a negative result there, where most upper bits set.
>>
>> PI2FD mm1, mm1 ; 0, 3f8..,400..
>> pslld mm1, 1 ; clear sign bit
>> psrld mm1, 24 ; shift 3f8... to 7f (bit 0)
>> psubd mm1, mm7 ; - 5f:7f, considers +32 in highword
>>
>>3. Both dwords in mm1 contain signed square coordinates if >= 0 each.
>> Both dwords are moved to memory via register pointer ecx unconditionally.
>> But the pointer is incremented by (sq >= 0) rather than one.
>>
>> PMOVMSKB eax,mm1 ; get sign bits 123,567 !vector path instr.!
>> // time to (p)or some other information (source square) into mm1
>> not al ; complement sign bits
>> mov edx, eax
>> and al, 0x04 ; 0x04*(ldword >= 0) considers sizeof(int32)
>> and dl, 0x40 ; 0x40*(hdword >= 0)
>> movd [ecx], mm1 ; write ldword
>> add ecx, eax ; moves += (ldword >= 0);
>> punpckhdq mm1, mm1 ; hdword, like >>= 32
>> shr edx, 4 ; 0x04*(hdword >= 0) considers sizeof(int32)
>> movd [ecx], mm1 ; write hdword
>> add ecx, edx ; moves += (hdword >= 0);
>>
>>4. [optional performance step - loop unrolling]
>> To gain more parallelism und and to solve the dependency chains,
>> repeat the above sequence one to three times (2-4 in total),
>> with mm2...mm5 instead of mm1, and do some clever instruction mixing.
>>
>>5. If any bits left in mm0, goto 1.)
>>
>> pxor mm1, mm1 ; 0
>> pcmpeqd mm1, mm0 ; == 0
>> PMOVMSKB eax,mm1 ; get dwords to nibble
>> cmp eax, 0xff ; all zero ?
>> jne doNscans
>>
>>This step is even not necessary in a specialized version of this routine for
>>rook and bishop captures, if step 4 containes 3 unrolled repetitions.
>>This test may also be done at the beginning of the loop.
>>
>>6. Implicite popcount (return (ecx - initial pointer)/sizeof(int) ), so no
>>terminator necessary.
>>
>>--
>>
>>There are two sources of parallelism. The first one the parallel SIMD processing
>>of two dwords per mmx-register. The performance decreases, if both 32bit dwords
>>have different popcounts. That implies even more dummy writes.
>>
>>The second one is the use of multiple mmx registers after resetting the previous
>>isolated one. But this sucks if the target sets are sparely populated. The
>>relative performance is obviosly depending on the populations count.
>>
>>Best case with four mmx-registers is finding eight squares per iteration and no
>>dummy write. Worst case (assume != 0) is only finding one square, but seven
>>dummy writes. No idea about the penaly of multiple writes to the same address.
>>Isn't it depending on cache strategy (write through, or dirty tag)?
>>
>>Due to parallel processing of low and high dwords, the generated target square
>>order may alternate between the two board halfs.
>>
>>Of course, i will try it in IsiChess. Bitboard parameter may be passed through
>>mmx-register from a previous mmx-attack getter. But in IsiChess during capture
>>generation, target sets are often rarely populated, due to generating captures
>>to one kind of piece per time. Have to think about...
>>
>>Cheers,
>>Gerd
>
>
>well, I had a similar idea and implemented it with the following code (based on
>Walter's algorithm):
>
>void add_to_move_buffer(bitboard a, int initial_square, zmove *mbuf, int
>*buff_pos)
>{
> bitboard t1, t2;
> zappa_u32 a32, b32, unroll = 0, cont = a.l[0] | a.l[1];
> zmove m;
>
> if(cont == 0)
> return;
>
> m.m.prom_to = 0;
> m.m.in_loc = initial_square;
>
> while(cont)
> {
> //copy space
> mbuf[*buff_pos].i = m.i;
> mbuf[*buff_pos + 1].i = m.i;
>
> //pull off two leading ones (sequential)
> t1.i = a.i - 1;
> a.i &= t1.i; //remove high bit
> t1.i ^= a.i; //isolate high bit
> a32 = t1.l[1] ^ t1.l[0]; //first bit
> unroll = a.l[1] | a.l[0];
>
> t2.i = a.i - 1;
> a.i &= t2.i;
> t2.i ^= a.i;
> b32 = t2.l[1] ^ t2.l[0]; //second bit
> cont = a.l[1] | a.l[0];
>
> //interleave folding and tb lookup
> a32 ^= LSB_64_magic;
> b32 ^= LSB_64_magic;
> a32 += a32 >> 16;
> b32 += b32 >> 16;
> a32 -= a32 >> 8;
> b32 -= b32 >> 8;
> a32 = LSB_64_table[LSB_64_adj + (zappa_u8)a32];
> b32 = LSB_64_table[LSB_64_adj + (zappa_u8)b32];
>
> //write data
> mbuf[*buff_pos].m.f_loc = a32;
> mbuf[*buff_pos+1].m.f_loc = b32;
> *buff_pos += 2;
> }
>}
>
>This reduced zappa's NPS by about 5%.
>
>As far as I can tell, the problem is that the array is so sparse. For example,
>a Knight will have maybe 3 or 4 moves in the middlegame, a King 2-3, etc. Also,
>there is a good chance that they will all be in the same half of the bitboard,
>so I think that doing the upper and lower halves of the board in parallel will
>probably not make much of a difference.
>
>What I am going to try next (after I finish my 5 projects due by thursday) is
>doing two of these move generations in parallel, that is doing the
>bitboard->movelist for two knights in parallel, etc.
>
>anthony
I am thinking that perhaps the bit search and reset solves the wrong problem for
this loop. You can be very good at pulling off the -first- bit, but you have to
go back and search again each time.
This loop only needs to generate an unordered list of moves. Someone else
already brought up this point. If you allow for the wider constraints, you
should be able to find an even better solution.
-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.