Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Bitboards, pre-computing moves, and blocked squares?

Author: Gerd Isenberg
Date: 09:16:02 12/03/04
On December 03, 2004 at 07:22:10, Vasik Rajlich wrote:

>On December 02, 2004 at 17:39:01, Gerd Isenberg wrote:
>
>>On December 02, 2004 at 14:20:40, Vasik Rajlich wrote:
>>
>>>On December 02, 2004 at 12:51:28, Gerd Isenberg wrote:
>>>
>>>><snip>
>>>>>>Hi Vas,
>>>>>>
>>>>>>yes, you are right. In a rotated bitboard environment the x^(x-2) trick doesn't
>>>>>>makes sense, because a lookup already provides attacks in both opposite
>>>>>>directions on one ray for one particular sliding piece.
>>>>>>
>>>>>>In my current mmx-implementation i don't use rotated attack lookups anymore but
>>>>>>fillbased direction wise attack getters (Kogge-Stone and dumb7fill).
>>>>>>As you can see, the fillbased attack getters work with bitboard parameters
>>>>>>instead of a singular square index, and are able to generate attacks of sets,
>>>>>>multiple pieces (rooks/queen for straight directions, bishops/queen for diagonal
>>>>>>directions).
>>>>>>
>>>>>>Those fill based routines are great to get sets of taboo squares for the
>>>>>>opposite king and to get pinned pieces - and to get something like a progessive
>>>>>>mobility, e.g. a set of squares rooks may reach in two or even more moves etc..
>>>>>>
>>>>>>I use the x^(x-2) trick to substitute following Kogge-Stone fill-routine, which
>>>>>>saves a few cycles for this particular direction:
>>>>>>
>>>>>>// Kogge-Stone implementation by Steffan Westcott
>>>>>>
>>>>>>BitBoard getRightRookAttacks (BitBoard rooks, BitBoard empty)
>>>>>>{
>>>>>>  return shiftRight(fillRightOccluded(rooks, empty));
>>>>>>}
>>>>>>
>>>>>>BitBoard shiftRight (BitBoard b)
>>>>>>{
>>>>>>  return (b<<1) & 0xfefefefefefefefe;
>>>>>>}
>>>>>>
>>>>>>BitBoard fillRightOccluded(BitBoard g, BitBoard p)
>>>>>>{
>>>>>>  p &= 0xfefefefefefefefe;
>>>>>>  g |= p & (g << 1);
>>>>>>  p &= (p << 1);
>>>>>>  g |= p & (g << 2);
>>>>>>  p &= (p << 2);
>>>>>>  return g |= p & (g << 4);
>>>>>>}
>>>>>
>>>>>Aha. So, you have also something like:
>>>>>
>>>>>BitBoard getALLRookAttacks (BitBoard rooks, BitBoard empty)
>>>>>{
>>>>>return
>>>>>      getRightRookAttacks (rooks, empty) |
>>>>>      getLeftRookAttacks (rooks, empty) |
>>>>>      getUpRookAttacks (rooks, empty) |
>>>>>      getDownRookAttacks (rooks, empty) ;
>>>>>}
>>>>>
>>>>>And then you can do things like:
>>>>>
>>>>>Bitboard two_moves_away = getALLRookAttacks (getALLRookAttacks (rooks, empty),
>>>>>empty);
>>>>
>>>>
>>>>That's the idea.
>>>>
>>>>>
>>>>>etc ...
>>>>>
>>>>>I'd guess this is slower than a lookup for a one-bit bitboard, but more
>>>>>efficient once the mask starts to become more populated.
>>>>
>>>>Yes. I even tried it with single pupulated pieces (1ULL<<sq) in my former
>>>>rotated lookup approach. Queen attacks (eight directions) is IIRC something like
>>>>100 cycles with mmx-assembly (branchless and up to four mmx-instructions
>>>>simultaniously).
>>>>
>>>>Huge 64-bit register files are necessary (AMD64, MMX, SSE2, Altivec, Itanium).
>>>>Of couse it sucks on x86-32 with the few 32-bit gp-registers.
>>>>But of course, 1/1 replacement of rotated attack getters with fill-routines is
>>>>not the smartest way...
>>>>
>>>
>>>As far as I can see, in order to generate moves, you'll need to use the (1<<sq)
>>>bitboards. Maybe it makes sense to keep the lookup tables for move gen, and the
>>>fill routines when you need to "fill" starting with >1 bit.
>>
>>
>>That is a very reasonable approach - a pure fill approach requires some redesign
>>to become efficient.
>>
>>With fill routines you are more in the pure bitboard centric world of sets
>>without the need of converting single bits into square metric by "dead slow"
>>bitscan ;-)
>>
>
>Hmmm. I follow the reference :) - and there is some truth to it. The approach of
>bitscanning, then looking up each square, "feels" kind of slow. I am not sure
>how you can avoid it for generating moves, though.
>
>>>
>>>>
>>>>>
>>>>>This might be an interesting tool to do some sort of "strategic reasoning". For
>>>>>example, pass (!pawns) into the second parameter, or even (! (very fixed
>>>>>pawns)), etc.
>>>>
>>>>Exactly - one may even "and" the "atacks in one set" with other taboo and
>>>>occupied by own sets...
>>>>
>>>
>>>Actually yes I can see a ton of ways to use this. The question is which ones are
>>>really helpful eval terms.
>>
>>
>>I guess there are a few conditional ones looking for trapped sliders and looking
>>for sliding trajectories against the opposite king or the ability to control far
>>stopp squares of passers in at least two moves etc.
>>
>>But yes, difficult stuff.
>
>I became convinced quite a long time ago that bitboards would shine once you
>start to make a really good evaluation - in this I agree with Bob Hyatt.
>
>Unfortunately at the moment Rybka follows an even more basic law: if your
>evaluation is bad, at least keep it cheap ;) There's nothing worse than a big
>mess of untested eval terms ...
>
>>
>>
>>>
>>>>>
>>>>>BTW I don't see how to get pins from this - unless you will manually unset bits
>>>>>in the "empty" bitboard ...
>>>>
>>>>No - you have only to consider the king as meta-slider as well and then "and"
>>>>opposite directions, e.g. for one of eight possible pin-directions:
>>>>
>>>>pinnedRem[RIGHT] = getRightRookAttacks (opprooks|oppqueens, empty)
>>>>                 & getLeftRookAttacks  (ownKing, empty);
>>>>
>>>>pinned[ownside][RIGHT] = pinnedRem[RIGHT] & ownPieces;
>>>>remChe[oppside][RIGHT] = pinnedRem[RIGHT] & oppPieces;
>>>>... all branchless stuff
>>>>
>>>
>>>Aha, that's nice. You even have the pinning direction.
>>>
>>>In fact you'd probably do:
>>>pinnedRem[RIGHT_LEFT] =
>>>   (getRightRookAttacks (opprooks|oppqueens, empty)
>>>  & getLeftRookAttacks  (ownKing, empty))
>>>|  (getLeftRookAttacks (opprooks|oppqueens, empty)
>>>  & getRightRookAttacks  (ownKing, empty))
>>>
>>>and ditto for UP_LEFT. I can't see a reason to keep RIGHT and LEFT separate.
>>>
>>
>>hmm... not sure, the disjoint directions are easier to handle to find the
>>sliding pinner later. The popularity is never greater one - simpler, branchless
>>coding.
>>
>>Of course in this LEFT_RIGHT case you can easily separate both direction wise
>>sets by masking with ownKBB-1 or ~(ownKbb-1). But extracting unified sets is not
>>often so simple, at least not so simple as to waste some memory keeping disjoint
>>sets and to unify by "or" on the fly if required.
>>
>
>Yes, that's true. I thought that the sliding pinner can be found just as easily
>for LEFT_RIGHT as for just LEFT, ie.
>
>Bitboard pinner = opp_rooks | getLeftRightRookAttacks (l_r_pinned_piece, empty)
>
>vs
>
>Bitboard pinner = opp_rooks | getRightRookAttacks (l_pinned_piece, empty)
>
>but actually getRightRookAttacks is cheaper than getLeftRightRookAttacks.
>
>It's also true that it's easy to | the masks - so maybe it's better to keep them
>as granular as possible.
>
>>
>>>>In case of a sliding check on this direction, pinnedRem[RIGHT] contains a set of
>>>>empty squares between checking piece and checked king, which also is usefull.
>>>>With fillalgos it is important to keep disjoint directions.
>>>>
>>>
>>>True.
>>>
>>>>In opposite to Steffan i like to keep separated attacks for each sliding piece
>>>>as well - therefore i'm playing with a quad-bitboard approach.
>>>>The idea is to pass a quad-bitboard to fill-routines and to distinguish between
>>>>attacks of 15 different pieces.
>>>
>>>Ok here you lost me :)
>>
>>hehe - i'm in the mood to explain it a bit,
>>also to reflect it by myself ;-)
>>
>>Imagine this stange position occurs during search:
>>
>>[D] 8/1R2rk2/8/3q4/8/8/r1R4K/6Q1 w - -
>>
>>After there is no early return in search or qsearch, due to repetition, hash-hit
>>or lazy leaf cut (depth <= 0 && lazyOrHashedEva >= beta ),
>>i need attack maps for pinned piece detection, eval and (legal) movegen
>>purposes.
>>
>>Two fills per direction ([RB]Q,K) for pinned piece detection.
>>Two additional fills (later for movegen) if we like to distinguish remove
>>checking moves.
>>
>>In eval it is nice to have all eight attack sets piece wise, to get trapped
>>pieces. Single fills per piece and direction. Like rotated, one has to traverse
>>all eight man, including queen-like meta-king attacks as part of a kingsafety
>>term and determing sliding check targets), but one may safe the bitscan and only
>>isolate least significant one-bit by (x & -x).
>>
>>The eval part may even be skipped as well, due to laziness or eval hash hit. The
>>relation of total nodes searched, lazy cut-nodes and full eval nodes is
>>interesting here.
>>
>>So conditionally eight fills per direction for eval.
>>
>>If no regulary eval leaf cut occurs, one has to traverse three white pieces for
>>sliding move generation. This might be implemented at once, or incrementally to
>>safe work in cut nodes.
>>
>>Up to more three fills per direction here.
>>
>>In best case there are two fills per dir with approach in mind, in worst case 8
>>or up to 2+8+2+3 = 15.
>>
>>Instead of passing 8 or even 15 bitboards to one Kogge-Stone-routine, processing
>>n-times the empty propagator, i consider one quadword as kogge-stone generator.
>>The eight pieces above and empty square are coded in four bits eg. in following
>>manner:
>>
>>0000 empty
>>0001 wR1
>>0010 wR2
>>0100 wQ
>>0111 wK
>>1001 bR1
>>1010 bR2
>>1100 bQ
>>1111 bK
>>
>>Those bits are arranged vertically in four bitboards,
>>the quad bitboard for straight sliders:
>>
>>     0123456789...........63
>>g0   00....0.1............0 // R1 or K
>>g1   00....0.0............0 // R2 or K
>>g2   00....1.0............0 // Q  or K
>>gb   00....0.1............0 // black
>>           | |
>>           | bK on a2
>>           wQ on g1
>>
>>Now one quad-bitboard kogge-stone fill routine looks like
>>
>>QuadBitBoard fillRightOccluded(QuadBitBoard g, BitBoard p)
>>{
>>  p    &= 0xfefefefefefefefe;
>>  g.g0 |= p & (g.g0 << 1);
>>  g.g1 |= p & (g.g1 << 1);
>>  g.g2 |= p & (g.g2 << 1);
>>  g.gb |= p & (g.gb << 1);
>>  p    &= (p << 1);
>>  g.g0 |= p & (g.g0 << 2);
>>  g.g1 |= p & (g.g1 << 2);
>>  g.g2 |= p & (g.g2 << 2);
>>  g.gb |= p & (g.gb << 2);
>>  p    &= (p << 2);
>>  g.g0 |= p & (g.g0 << 4);
>>  g.g1 |= p & (g.g1 << 4);
>>  g.g2 |= p & (g.g2 << 4);
>>  g.gb |= p & (g.gb << 4);
>>  return g;
>>}
>>
>>with some nice prospects of parallel execution, specially considering parallel
>>work of different directions in disjoint execution units like 64-bit gp and/or
>>mmx as well as SSE2.
>>
>>Disjoint piece attacks may easily be reconstrcuted on the fly with a few bitwise
>>operations.
>>
>>The result of those routines is probably stored in some kind of thread-global
>>memory and therefore with limited validity during a search routine.
>>
>>I even consider to do some parts of this stuff before probing the prefetched
>>main-tt to hide "huge" latencies and to have a better base to decide about lazy
>>eval and possible early mate or stalemate detection.
>>
>>I hope you got the idea - problem with more than two rooks or more than one
>>queen per side. Special handling required for those rare case, since there is no
>>unique target, source relationship anymore of the direction wise attacks.
>>
>>Diagonal "coding" is a bit easier since two bishops per side usually have
>>disjoint sets of squares.
>
>Ok, after quite some time :) I think I (somewhat) understand.
>
>IIUC - the "p" parameter to your quad bitboard routines truly must be the set of
>"empty" squares - no special eval tricks passing it (~pawns) are possible, since
>then the bit codes would start overwriting each other. Also, the g parameter
>truly must be the set of squares on which those pieces sit - you will no longer
>be able to nest calls to this routine, because again the bits will start
>overwriting each other.


You are right, all one bits of all four generator bitboards g must be not member
of empty board propagator. Therefore the intension of this routine is only to
get "pass one" attacks with none modified empty board propagator.


>
>For instance, no more of things like:
>
>quad-Bitboard two_move_attacks =
>fillRightOccluded(fillRightOccluded(majors_quad_bitboard, empty), empty)
>
>A few things are still a bit unclear:
>
>1) Why do you use 4 bits rather than 3 bits to encode the 8 piece combinations?

Because of the empty square code.

Only a binary "one" inside a generator bitboard "grows" inside a fillroutine in
the appropriate direction, if it becomes "zero" due to a blockading piece, the
fill process terminats on this ray.
Therefore the empty code of a quad bitboard must be 0000b.

Another reason is that i like to use SIMD bitboards, e.g. vectors of two
bitboards in 128-bit xmm registers.

The incomplete coding of 9/16 allows a rather efficient (i hope) synthesis and
analysis to/from a quad-bitboard:

synthesis before filling all straight directions:

kings  = wking   | bking;
rooks1 = wrook1  | brook1;
rooks2 = wrook2  | brook2;
queems = wqueens | bqueens;

g0 = wrook1  | kings;
g1 = wrooks2 | kings;
g2 = queens  | kings;
gb = bpieces;

analyse after filling:

AttacksOfkings  =  g0 &  g1;
AttacksOfrooks1 =  g0 & ~g1;
AttacksOfrooks2 = ~g0 &  g1;
AttacksOfqueens = ~g0 &  g2;

AttacksOfbking  = AttacksOfkings &  gb;
AttacksOfwking  = AttacksOfkings & ~gb;
...


>2) Do you really need a separate entry for white rook 1 and white rook 2? You
>could give both of them the same bit sequence - this would also get rid of the
>problems with >2 rooks or >1 queen.

I know, but there is one problem with multiple rooks, i like to avoid - to find
the source square, where a rook sits, from a given target square the rook
attacks, since more than one rook may share the same rank or file.

Steffan's solution is to do an additional fill in the opposite direction:

http://chessprogramming.org/cccsearch/ccc.php?art_id=266676

Therefore i like to get disjoint attacks for most common cases:

rook1  = rooks & -rooks;     // lsb
rooks2 = rooks & (rooks-1);  // most often popcount(rooks2) < 2

Only if popcount(rooks2) is greater one, i have to do it in Steffan's way.
Same for > 1 queen or > 1 equal colored bishops.

>
>Of course, then you won't know which rook attacks the given square. This is the
>part I don't really understand w.r.t. move generation - if you do
>FillRightOccluded (rooks, empty), you have target squares for rooks, but not
>which square is for which rook.

Ah yes, you gave the reason by yourself...

>
>Actually - IIUC - instead of a quad-bitboard you could use a six-bitboard
>combination, one bit for each of the six major pieces. Since there would be no
>overlap between the bits, you could nest the calls, etc, and maybe still get
>some parallel performance benefit.

That is of course possible, but register pressure becomes greater.
Currently the mentioned quad bitboard fill-routine works great with SSE2 and
eight xmm-registers without temporary store/load.

I currently use the ms 2005 Visual C++ beta compiler with SSE2 intrinsics and no
more inline assembly. I use wrapped bitboard pair classes with a common base and
a two conrete derived classes, one for SSE2 the other for gp-registers.

Polymorph Kogge-Stone fillers working with those wrapped vectors of bitboards
are void, but are working on dedicated structures (classes) via source and
target pointers. They have a class template <T> which can be GP or XMM.
So several directions may be implemented with different register sets, and the
ms compiler is able to nicely schedule interleaved instructions of different
inlined fill routines.

I'm not quite sure, what to keep finally in memory for reusing.
The quad-bitboard result for each of the eight directions should be temporary in
registers.

Maybe something like this:

              single directions         unified (ored) dirs
dir/piece(s)  ri rd do ld le lu up ru   hor ver di1 di2 str dia all nBBs
rook1         x     x     x     x       x   x           x       x   8
rooks2        x     x     x     x       x   x           x       x   8
allrooks      x     x     x     x       x   x           x       x   8
bishops          x     x     x     x            x   x       x   x   8
queens        x  x  x  x  x  x  x  x    x   x   x   x   x   x   x  15
metaking      x  x  x  x  x  x  x  x    x   x   x   x   x   x   x  15
                                                               sum 62 BBs

hmm... 62, let say 64 disjoint as well as unified bitboards for one color, that
makes 128 bitboards = 1024Byte = 1KByte = 16 cache lines (of 1024).

I guess the whole stuff only makes sense if that very often used memory is
"allways" in L1-data cache...

Gerd

>
>Vas
>
>>
>>
>>>
>>>But thanks for reminding me why I put bitboards into my program.
>>
>>No problem ;-)
>>
>>Cheers,
>>Gerd
>>
>>>
>>>Vas
>>>><snip>
Re: Bitboards, pre-computing moves, and blocked squares? Vasik Rajlich 12:13:51 12/03/04
- Re: Bitboards, pre-computing moves, and blocked squares? Gerd Isenberg 13:42:53 12/03/04
  - Re: Bitboards, pre-computing moves, and blocked squares? Vasik Rajlich 03:44:19 12/04/04
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.