Author: Chan Rasjid
Date: 23:51:01 01/19/06
Go up one level in this thread
On January 19, 2006 at 14:26:11, Gerd Isenberg wrote:
Thanks Gerd,
As expected, we cannot trick the world. I did not suspect boolean may translate
to branching of 0/1, am still learning!
Incidentally, I'll like to mention the bitboard vs offset debate. I can now see
a complete BB implentation as I have started from scratch and now have completed
a BB implementation except for eval() and 1 line trigering an assert(). So
everything's fine. It does not need a BB move generator! rather it is BB
incremental generation plus partial as we can easily extract 1 move at a time.
In the past Christophe Theron argued against BB pointing out not a single BB
engine made it to to SSDF list. Now we have Rybka at the top so likely others
may think a little more about BB.
1) BB_TO_SQ() call - I am not sure this was one of the main contentious issue.
I dont do rotated,just improvise. I doubt rotated will add much advantage as
there is the law of natural limitation in everything. For sliding pieces
and given a 2 bit BB of bbFrom, bbTo, I will have the fromSq and need to get
toSq. I try this. The sliding path space of 4 are +15, +17, +16, +1.
signed i64 bbTo, bb2bit;
bbTo -= bbFrom;
no_of_step = i = popcount(bbTo * (1 | bbTo >> 63) & dir_path);//may use abs()
toSq = fromSQ + i * (1 | bbTo >> 63) * positive_step;
We may also use popcount((bb2bit & bb2bit - 1 - (bb2bit & -bb2bit))& dir_path);
I have not tested if the above is better then BB_TO_SQ(), at least I'm happy not
having to shift by ranks with a table lookup.
2) I think all BB engine don't have a generator, just incremental updates.
BB moves are basically the attack BB of all the pieces + 1 extra BB for pawns
push. N / K don't even need any update unless they move and I have this :-
static const u64 kDzE4 = 0x0000382838000000;
static const u64 nDzE4 = 0x0028440044280000;
static const u64 nMask[2] = {FILE_3TO8, FILE_1TO6};
static const u64 kMask[2] = {FILE_2TO8, FILE_1TO7};
u64 getNMap(int sq){
if (sq >= E4)
return (nDzE4 << (sq - E4)) & nMask[FILE64(sq) - 4 >> 31 & 1];
return (nDzE4 >> (E4 - sq)) & nMask[FILE64(sq) - 4 >> 31 & 1];
}
u64 getKMap(int sq){
if (sq >= E4)
return (kDzE4 << (sq - E4)) & kMask[FILE64(sq) - 4 >> 31 & 1];
return (kDzE4 >> (E4 - sq)) & kMask[FILE64(sq) - 4 >> 31 & 1];
}
Pawns are simple.
We have :-
//pS is pointer to global stack for my non-recursive search.
*(pS + 1)->bbMoveStack = *(pS - 1)->bbMoveStack;
doMove(board, pS, *pS->pCurrentMove, side);
if (genIncrementalBB(board, pS + 1, *pS->pCurrentMove, incheck,
pS->sqCaptureEP, side ^ 1 , side, *board->pAllBits));
else{//capt King, tmp
undoMove(board, pS, *pS->pCurrentMove, side);
//etc..
++pS->pCurrentMove;
continue;
}
genIncrementalBB(board, pS, *pS->pCurrentMove, incheck,
pS->sqCaptureEP, side, side, *board->pAllBits);
At this point I have the complete BB attack-tables of the two sides and may
simply extract 1 move at a time and have complete info to select the best move
of each phase.With offset, I cannot see how incremental move generation may be
done. Fruit may have partial but not incremental generation.
3) QS move generation - In the past debate, it seemed no one mentioned this.
90% of all nodes are QS nodes. Getting BB captures are the simplest. With
offset we still have to start from the targets and test by sliding thru empty
squares etc. So BB here may have a significantly influential advantage. Is this
the "secret of Rybka" ?
Best Regards,
Rasjid
January 19, 2006 at 12:45:10, Chan Rasjid wrote:
>
>>
>>When using bitboards, often we have arrays that require only indices 0 / 1.
>>eg if we have no underpromotions, then there is either 1/2 knights per color.
>>Sometimes ,in such situations, codes may be re-written without using the if ()
>>statement. But whether such simple tricks can avoid the cost of branch
>>misprediction ( I only have a vague idea about this ).
>>
>>X[i] where i = 0 / 1;
>>
>>1) if ( a & b ){
>> j = X[1];
>> }else{
>> j = X[0];
>> }
>> //etc.. codes dependent on j.
>>
>>2) Now without if() :-
>>
>> j = X[(a & b)!= 0];
>> //etc.. codes dependent on j.
>
>Hi Rasjid,
>
>whether a boolean statement works without branches depends on the compiler and
>architecture. Recent x86 compiler should do at least 2) branchless, using a
>setnz-instruction for (a & b)!= 0, for instance:
>
>xor edx, edx ; clear dword
>test rax, rbx ; get the zero flag
>setnz dl ; edx := (a & b) ? 1 : 0
>mov edx, X[edx*4] ; assuming X is int array
>
>>
>>How (if ? ) can such a simple re-coding cause an improvement. Can there
>>be some situations related to the above with significanr misprediction
>>overheads ?
>
>If the pattern is hard to predict and/or you have so many branches, that the
>branch target buffer is likely to get trashed, i think it is a good idea to
>avoid conditional jumps. Specially with very small conditional bodies in small,
>but often used inlined functions or macros.
>
>With pattern like 99% of the time 00010001... (for instance 0=then case taken,
>1=else case taken) conditional jumps are fine. You may try to avoid else cases
>if possible as well - or leave that and possible branchless tricks to the
>"pogoable" compiler.
>
>While x86 branch replace instructions like setcc, cmovcc, sbb and adc are flag
>dependent, they don't allow a more dense instruction scheduling for multiple
>conditional assignments. That is not true for "sign flag" dependency after a
>subtraction, where the sign flag is also present as MSB inside a register. So
>one may use logical shift right 31 to get boolean {0,1}-values or arithmetical
>shift right 31 (or cdq) for {0,-1}-masks, if you are aware of the range of the
>signed or unsigned values to subtract, for instance:
>
>if ( a > b ) x += 7;
>if ( c > d ) x += 5;
>
>// use < 0 expressions - considering ranges!
>// this unsafe optimization can not be done by the compiler!
>if ( (signed)(b-a) < 0 ) x += 7;
>if ( (signed)(d-c) < 0 ) x += 5;
>
>// the branchless approach with
>// some prospects of parallel work
>x += ((signed)(b-a)>>31) & 7;
>x += ((signed)(d-c)>>31) & 5;
>
>possible assembly:
>
>sub ebx, eax ; b-a
>sub edx, ecx ; d-c
>sar ebx, 31
>sar edx, 31
>and ebx, 7
>and edx, 5
>add r08, ebx ; x += ((signed)(b-a)>>31) & 7
>add r08, edx ; x += ((signed)(d-c)>>31) & 5
>
>Pairs quite well, but P4 sucks due to slow shift.
>On amd that will take about the same number of cycles as one sequential
>conditional assignment. If processing vectors of this stuff, sse2 comes in mind.
>
>Gerd
>
>>
>>Thanks
>>Rasjid
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.