Author: Gerd Isenberg
Date: 14:10:30 12/06/02
Go up one level in this thread
On December 06, 2002 at 16:34:09, Anthony Cozzie wrote: >On December 06, 2002 at 14:04:07, Gerd Isenberg wrote: > >>Hi All, >> >>Some brainstorming, inspired by the recent postings of Miguel A. Ballicora, to >>serialize a bitboard on block, producing a terminated string of target squares. >> >>http://www.talkchess.com/forums/1/message.html?268658 >> >>Another point is the kind of traversing bitboards during move generation, >>already discussed recently by the bitboard connection. The nested while loops >>over piece sets in the outer and attacking target squares in the inner loop. >> >>With Floodfill/Kogge-Stone Attack generation the attacked getter routines need a >>bitboard (isolated by b&-b, due i need disjoint target sets, even if there are >>possibly smarter approaches by Steffan Westcott) as parameter, rather than a >>square index as integer. Is it necessary to do bitscans at all? >> >>As Sune and Russell already mentioned, one need to address the zobrist and >>history tables. If one don't use from/to square coordinates, one need a hell >>fast hash function (at least faster than two bitscans) that map two isolated >>bitboards (difference?) to a quite short value range, in the manner of Walter >>Faxon's magig LSB_64. >> >>http://www.talkchess.com/forums/1/message.html?268531 >> >>I tried that mod5811, but two parallized mmx-bitscans are clearly faster. >> >>Miguel's algorithm already acts as a kind of move-generator, if we provide some >>approriate set of target squares and some preininialized move structure template >>(most lokely 32 bits with 6 empty bits in a row). But Miguel needs two nested >>file-rank loops, a lot of conditional stuff. The runtime behaviour may be >>dependent on the bit positions. >> >>-- >> >>How about this MMX,3DNow! (Athlon only) approach, which is able to do a lot in >>parallel and avoids most, if not all conditional jumps. The drawback are >>possible multiple writes to the same address. >> >>What is more expensive, a conditional jump with the risk of branch misprediction >>or a additional write with the same index during writing an array? >> >> >>Following steps: >>-- >>0. some register initialations >> // mm0 is the bitboards to serialize >> // mm7 5f:7f constant to subtract bit0-offset >> // ecx move list int32-pointer >> >>1. Isolating the bits in both 32-bit dwords. >> // mm0 is the bitboards to serialize >>doNscans: >> pxor mm1, mm1 ; 0 >> psubd mm1, mm0 ; -bb 32bitwise >> pand mm1, mm0 ; bb & -bb >> pxor mm0, mm1 ; reset isolated bits >> >>2. Two parallel bitscans with base two logarithms via PI2FD. >> If no bit is left anymore in one dword we produce >> a negative result there, where most upper bits set. >> >> PI2FD mm1, mm1 ; 0, 3f8..,400.. >> pslld mm1, 1 ; clear sign bit >> psrld mm1, 24 ; shift 3f8... to 7f (bit 0) >> psubd mm1, mm7 ; - 5f:7f, considers +32 in highword >> >>3. Both dwords in mm1 contain signed square coordinates if >= 0 each. >> Both dwords are moved to memory via register pointer ecx unconditionally. >> But the pointer is incremented by (sq >= 0) rather than one. >> >> PMOVMSKB eax,mm1 ; get sign bits 123,567 !vector path instr.! >> // time to (p)or some other information (source square) into mm1 >> not al ; complement sign bits >> mov edx, eax >> and al, 0x04 ; 0x04*(ldword >= 0) considers sizeof(int32) >> and dl, 0x40 ; 0x40*(hdword >= 0) >> movd [ecx], mm1 ; write ldword >> add ecx, eax ; moves += (ldword >= 0); >> punpckhdq mm1, mm1 ; hdword, like >>= 32 >> shr edx, 4 ; 0x04*(hdword >= 0) considers sizeof(int32) >> movd [ecx], mm1 ; write hdword >> add ecx, edx ; moves += (hdword >= 0); >> >>4. [optional performance step - loop unrolling] >> To gain more parallelism und and to solve the dependency chains, >> repeat the above sequence one to three times (2-4 in total), >> with mm2...mm5 instead of mm1, and do some clever instruction mixing. >> >>5. If any bits left in mm0, goto 1.) >> >> pxor mm1, mm1 ; 0 >> pcmpeqd mm1, mm0 ; == 0 >> PMOVMSKB eax,mm1 ; get dwords to nibble >> cmp eax, 0xff ; all zero ? >> jne doNscans >> >>This step is even not necessary in a specialized version of this routine for >>rook and bishop captures, if step 4 containes 3 unrolled repetitions. >>This test may also be done at the beginning of the loop. >> >>6. Implicite popcount (return (ecx - initial pointer)/sizeof(int) ), so no >>terminator necessary. >> >>-- >> >>There are two sources of parallelism. The first one the parallel SIMD processing >>of two dwords per mmx-register. The performance decreases, if both 32bit dwords >>have different popcounts. That implies even more dummy writes. >> >>The second one is the use of multiple mmx registers after resetting the previous >>isolated one. But this sucks if the target sets are sparely populated. The >>relative performance is obviosly depending on the populations count. >> >>Best case with four mmx-registers is finding eight squares per iteration and no >>dummy write. Worst case (assume != 0) is only finding one square, but seven >>dummy writes. No idea about the penaly of multiple writes to the same address. >>Isn't it depending on cache strategy (write through, or dirty tag)? >> >>Due to parallel processing of low and high dwords, the generated target square >>order may alternate between the two board halfs. >> >>Of course, i will try it in IsiChess. Bitboard parameter may be passed through >>mmx-register from a previous mmx-attack getter. But in IsiChess during capture >>generation, target sets are often rarely populated, due to generating captures >>to one kind of piece per time. Have to think about... >> >>Cheers, >>Gerd > > >well, I had a similar idea and implemented it with the following code (based on >Walter's algorithm): > >void add_to_move_buffer(bitboard a, int initial_square, zmove *mbuf, int >*buff_pos) >{ > bitboard t1, t2; > zappa_u32 a32, b32, unroll = 0, cont = a.l[0] | a.l[1]; > zmove m; > > if(cont == 0) > return; > > m.m.prom_to = 0; > m.m.in_loc = initial_square; > > while(cont) > { > //copy space > mbuf[*buff_pos].i = m.i; > mbuf[*buff_pos + 1].i = m.i; > > //pull off two leading ones (sequential) > t1.i = a.i - 1; > a.i &= t1.i; //remove high bit > t1.i ^= a.i; //isolate high bit > a32 = t1.l[1] ^ t1.l[0]; //first bit > unroll = a.l[1] | a.l[0]; > > t2.i = a.i - 1; > a.i &= t2.i; > t2.i ^= a.i; > b32 = t2.l[1] ^ t2.l[0]; //second bit > cont = a.l[1] | a.l[0]; > > //interleave folding and tb lookup > a32 ^= LSB_64_magic; > b32 ^= LSB_64_magic; > a32 += a32 >> 16; > b32 += b32 >> 16; > a32 -= a32 >> 8; > b32 -= b32 >> 8; > a32 = LSB_64_table[LSB_64_adj + (zappa_u8)a32]; > b32 = LSB_64_table[LSB_64_adj + (zappa_u8)b32]; > > //write data > mbuf[*buff_pos].m.f_loc = a32; > mbuf[*buff_pos+1].m.f_loc = b32; > *buff_pos += 2; > } >} > >This reduced zappa's NPS by about 5%. > >As far as I can tell, the problem is that the array is so sparse. For example, >a Knight will have maybe 3 or 4 moves in the middlegame, a King 2-3, etc. Also, >there is a good chance that they will all be in the same half of the bitboard, >so I think that doing the upper and lower halves of the board in parallel will >probably not make much of a difference. > >What I am going to try next (after I finish my 5 projects due by thursday) is >doing two of these move generations in parallel, that is doing the >bitboard->movelist for two knights in parallel, etc. > >anthony Hi Anthony, That's interesting, Walter's routine is really great. Yes the problems are the sparely populated bitboards. The mmx-stuff makes high- and low-board in parallel due to SIMD-instructions. The Athlon(XP?) makes up to four register independent mmx-instruction simultaniously. Therefor step 4.) is really required. Another idea is to do some BxQ and BxR in parallel without any conditional jumps, except an initial "if ( bb != 0 )", or may be not even that. Similar SSE2-instructions are also available for P4 and Hammer of course. But Athlon is about twice as fast as P4 with this independent mmx-stuff. btw. maximum iteration count on a halfboard per piece: (captures) Rook 10 (4) Bishop 7 (4) Queen 17 (8) Knight 6 (6) King 8 (8!) Regards, Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.