Author: Vasik Rajlich
Date: 06:02:11 01/20/06
Go up one level in this thread
On January 20, 2006 at 02:51:01, Chan Rasjid wrote: >On January 19, 2006 at 14:26:11, Gerd Isenberg wrote: > >Thanks Gerd, > >As expected, we cannot trick the world. I did not suspect boolean may translate >to branching of 0/1, am still learning! > >Incidentally, I'll like to mention the bitboard vs offset debate. I can now see >a complete BB implentation as I have started from scratch and now have completed >a BB implementation except for eval() and 1 line trigering an assert(). So >everything's fine. It does not need a BB move generator! rather it is BB >incremental generation plus partial as we can easily extract 1 move at a time. > >In the past Christophe Theron argued against BB pointing out not a single BB >engine made it to to SSDF list. Now we have Rybka at the top so likely others >may think a little more about BB. > >1) BB_TO_SQ() call - I am not sure this was one of the main contentious issue. >I dont do rotated,just improvise. I doubt rotated will add much advantage as >there is the law of natural limitation in everything. For sliding pieces >and given a 2 bit BB of bbFrom, bbTo, I will have the fromSq and need to get >toSq. I try this. The sliding path space of 4 are +15, +17, +16, +1. > >signed i64 bbTo, bb2bit; > >bbTo -= bbFrom; >no_of_step = i = popcount(bbTo * (1 | bbTo >> 63) & dir_path);//may use abs() >toSq = fromSQ + i * (1 | bbTo >> 63) * positive_step; > >We may also use popcount((bb2bit & bb2bit - 1 - (bb2bit & -bb2bit))& dir_path); > >I have not tested if the above is better then BB_TO_SQ(), at least I'm happy not > having to shift by ranks with a table lookup. > >2) I think all BB engine don't have a generator, just incremental updates. >BB moves are basically the attack BB of all the pieces + 1 extra BB for pawns >push. N / K don't even need any update unless they move and I have this :- > >static const u64 kDzE4 = 0x0000382838000000; >static const u64 nDzE4 = 0x0028440044280000; > >static const u64 nMask[2] = {FILE_3TO8, FILE_1TO6}; >static const u64 kMask[2] = {FILE_2TO8, FILE_1TO7}; > >u64 getNMap(int sq){ > if (sq >= E4) > return (nDzE4 << (sq - E4)) & nMask[FILE64(sq) - 4 >> 31 & 1]; > return (nDzE4 >> (E4 - sq)) & nMask[FILE64(sq) - 4 >> 31 & 1]; >} > > >u64 getKMap(int sq){ > if (sq >= E4) > return (kDzE4 << (sq - E4)) & kMask[FILE64(sq) - 4 >> 31 & 1]; > return (kDzE4 >> (E4 - sq)) & kMask[FILE64(sq) - 4 >> 31 & 1]; >} > >Pawns are simple. > >We have :- >//pS is pointer to global stack for my non-recursive search. > > *(pS + 1)->bbMoveStack = *(pS - 1)->bbMoveStack; > doMove(board, pS, *pS->pCurrentMove, side); > if (genIncrementalBB(board, pS + 1, *pS->pCurrentMove, incheck, > pS->sqCaptureEP, side ^ 1 , side, *board->pAllBits)); > else{//capt King, tmp > undoMove(board, pS, *pS->pCurrentMove, side); > //etc.. > ++pS->pCurrentMove; > continue; > } > > genIncrementalBB(board, pS, *pS->pCurrentMove, incheck, > pS->sqCaptureEP, side, side, *board->pAllBits); > >At this point I have the complete BB attack-tables of the two sides and may >simply extract 1 move at a time and have complete info to select the best move >of each phase.With offset, I cannot see how incremental move generation may be >done. Fruit may have partial but not incremental generation. > I'm a little confused as to what exactly you're doing in the above code. For example, getting knight target squares is best done via simply: unsigned int64 knight_targets = knight_target_array [knight_sq] & ~same_color_pieces; Anyway, it's not clear that an incremental approach is faster - and it's a hell of a lot messier. I just compute everything when I need it. >3) QS move generation - In the past debate, it seemed no one mentioned this. > 90% of all nodes are QS nodes. Getting BB captures are the simplest. With >offset we still have to start from the targets and test by sliding thru empty >squares etc. So BB here may have a significantly influential advantage. Is this >the "secret of Rybka" ? > I don't know that it's really possible to compare board representations, and a few % speedup isn't all that important. Bitboards are certainly very elegant when it comes to various forms of "selective move generation". Vas >Best Regards, >Rasjid > > >January 19, 2006 at 12:45:10, Chan Rasjid wrote: >> >>> >>>When using bitboards, often we have arrays that require only indices 0 / 1. >>>eg if we have no underpromotions, then there is either 1/2 knights per color. >>>Sometimes ,in such situations, codes may be re-written without using the if () >>>statement. But whether such simple tricks can avoid the cost of branch >>>misprediction ( I only have a vague idea about this ). >>> >>>X[i] where i = 0 / 1; >>> >>>1) if ( a & b ){ >>> j = X[1]; >>> }else{ >>> j = X[0]; >>> } >>> //etc.. codes dependent on j. >>> >>>2) Now without if() :- >>> >>> j = X[(a & b)!= 0]; >>> //etc.. codes dependent on j. >> >>Hi Rasjid, >> >>whether a boolean statement works without branches depends on the compiler and >>architecture. Recent x86 compiler should do at least 2) branchless, using a >>setnz-instruction for (a & b)!= 0, for instance: >> >>xor edx, edx ; clear dword >>test rax, rbx ; get the zero flag >>setnz dl ; edx := (a & b) ? 1 : 0 >>mov edx, X[edx*4] ; assuming X is int array >> >>> >>>How (if ? ) can such a simple re-coding cause an improvement. Can there >>>be some situations related to the above with significanr misprediction >>>overheads ? >> >>If the pattern is hard to predict and/or you have so many branches, that the >>branch target buffer is likely to get trashed, i think it is a good idea to >>avoid conditional jumps. Specially with very small conditional bodies in small, >>but often used inlined functions or macros. >> >>With pattern like 99% of the time 00010001... (for instance 0=then case taken, >>1=else case taken) conditional jumps are fine. You may try to avoid else cases >>if possible as well - or leave that and possible branchless tricks to the >>"pogoable" compiler. >> >>While x86 branch replace instructions like setcc, cmovcc, sbb and adc are flag >>dependent, they don't allow a more dense instruction scheduling for multiple >>conditional assignments. That is not true for "sign flag" dependency after a >>subtraction, where the sign flag is also present as MSB inside a register. So >>one may use logical shift right 31 to get boolean {0,1}-values or arithmetical >>shift right 31 (or cdq) for {0,-1}-masks, if you are aware of the range of the >>signed or unsigned values to subtract, for instance: >> >>if ( a > b ) x += 7; >>if ( c > d ) x += 5; >> >>// use < 0 expressions - considering ranges! >>// this unsafe optimization can not be done by the compiler! >>if ( (signed)(b-a) < 0 ) x += 7; >>if ( (signed)(d-c) < 0 ) x += 5; >> >>// the branchless approach with >>// some prospects of parallel work >>x += ((signed)(b-a)>>31) & 7; >>x += ((signed)(d-c)>>31) & 5; >> >>possible assembly: >> >>sub ebx, eax ; b-a >>sub edx, ecx ; d-c >>sar ebx, 31 >>sar edx, 31 >>and ebx, 7 >>and edx, 5 >>add r08, ebx ; x += ((signed)(b-a)>>31) & 7 >>add r08, edx ; x += ((signed)(d-c)>>31) & 5 >> >>Pairs quite well, but P4 sucks due to slow shift. >>On amd that will take about the same number of cycles as one sequential >>conditional assignment. If processing vectors of this stuff, sse2 comes in mind. >> >>Gerd >> >>> >>>Thanks >>>Rasjid
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.