Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: kogge-stone, rotated bitboards, "raytracing" - on 32 or 64-bit platf

Author: martin fierz

Date: 14:49:21 09/28/04

Go up one level in this thread


On September 27, 2004 at 17:27:45, Gerd Isenberg wrote:

>On September 25, 2004 at 06:21:13, martin fierz wrote:
>
>>aloha!
>
>oloha!
>
>a bit more specific to your other questions.
>
><snip>
>>
>>here are my first 3 questions
>>
>>1) does anybody here have an idea how these 3 ways of generating attacks for
>>sliders compare speedwise for both 32-bit and 64-bit processors? (reason: if
>>kogge-stone will be comparable to rotated bitboards, i'd rather not bother to
>>implement rotated bitboards, and if both are not much better than my current
>>code, i'd also rather not bother with a rewrite).
>
>Comparing the pure attack getters for one sliding piece
>is clearly in favour for rotated.
>
>Instead of 2*2*14 or 2*2*16 (2 directions, 2*32-bit) alu-instructions for
>Kogge-Stone you have some rank and file frickling to "shift,and" the rotated
>occupied board to get the occupied index, some address calculation and one
>memory lookup. A drawback with 32-bit might be a function call for variable
>64-bit shift.
>
>But on 32-bit architectures like x86-32 Kogge-Stone is too expensive due to
>stack trafic.
>
>Considering the huge caches, rotated is fastest on 64-bit too.
>64-bit shift is cheap, so it's like a simple memory access.
>
>Incremental updating three additional rotated bitboards or even rotated indices
>is not that expensive. A pro for fillalgos is the use of bb & -bb instead of
>bitscanning - or even better to process all pieces of one side simultaniously.
>That requires to keep disjoint direction attacks in any case.
>
>Another point in favour to Kogge-Stone is the ability to process many directions
>or different disjoint sets in parallel, see my quad-bitboard Multi- Kogge-Stone
>approach.
>
>While rotated lookups, even in 2. or 1.level data cache, have a more sequential
>nature, piped architectures with huge register files and some alus may process
>up to four or more pure register instructions in parallel.
>
>Even if it takes some once allocated memory, keeping disjoint as well as
>combined attacks for a while has some advantages. E.g. pinned piece or remove
>checker detection is much easier and branchless unless you have sliding pieces.
>
>The question is whether it is worth to keep direction wise attacks even for
>disjoint pieces.
>
>>
>>2) in the code above, to make things more readable and compact i could write
>>functions like this (same for kogge-stone):
>>
>>forwardattack(org, empty)
>>  {
>>  int64 attack;
>>  x = forward(org); // x traces a ray forward; forward(x) is a macro
>>                    // forward(x) <--> (x = x<<8);
>>  while(x & empty)
>>    {
>>    attack |= x;
>>    x = forward(x);
>>    }
>>  return attack;
>>  }
>>
>>and then put together my rook attack function as
>>
>>for-all-rooks-do...
>>  attack =
>>forwardattack(rooksquare)|backwardattack(rooksquare)|leftattack(rooksquare)|rightattack(rooksquare);
>>
>>this would be much much cleaner instead of writing out the code all the time for
>>all attackers.
>
>Yes, use small functions with __inline or __forceinline.
>The compiler (at least VC2005beta) is able to schedule a rather independent
>instruction mix over several independent instructions.
>With my xmm-wrapper and AMD64 32-bit mode, this a actually fastest for rooks,
>80 cycles for two disjoint rook sets, very nice assembly scheduling:
>(i guess in 64-bit mode it's less than 30 cycles, using three times GPR)
>
>void rookAttacks(sTarget* pTarget, const sSource* pSource)
>{
>	riAttacks<XMM>(pTarget, pSource);
>	leAttacks<XMM>(pTarget, pSource);
>	upAttacks<GPR>(pTarget, pSource);
>	dnAttacks<XMM>(pTarget, pSource);
>}
>
>template <class T> __forceinline
>void riAttacks(sTarget* pTarget, const sSource* pSource)
>{
>#if 1
>	T rook(pSource->rooks);
>	T occu(pSource->occup);
>	(occu ^ ((occu - rook) - rook)).store(&pTarget->ri);
>#else
>	T gr(&pSource->rooks);
>	T pr(&pSource->occup);
>	pr = (~pr).notA();
>	gr |= pr & (gr<<1);
>	pr &= pr<<1;
>	gr |= pr & (gr<<2);
>	pr &= pr<<2;
>	gr |= pr & (gr<<4);
>	(gr<<1).notA().store(&pTarget->ri);
>#endif
>}
>
>template <class T> __forceinline
>void leAttacks(sTarget* pTarget, const sSource* pSource)
>{
>	T gl(pSource->rooks);
>	T pl(pSource->occup);
>	pl = (~pl).notH();
>	gl |= pl & (gl>>1);
>	pl &= pl>>1;
>	gl |= pl & (gl>>2);
>	pl &= pl>>2;
>	gl |= pl & (gl>>4);
>	(gl>>1).notH().store(&pTarget->le);
>}
>
>template <class T> __forceinline
>void upAttacks(sTarget* pTarget, const sSource* pSource)
>{
>	T gu(pSource->rooks);
>	T pu(pSource->occup);
>	pu = ~pu;
>	gu |= pu & (gu<<8);
>	pu &= pu<<8;
>	gu |= pu & (gu<<16);
>	pu &= pu<<16;
>	gu |= pu & (gu<<32);
>	(gu<<8).store(&pTarget->up);
>}
>
>template <class T> __forceinline
>void dnAttacks(sTarget* pTarget, const sSource* pSource)
>{
>	T gd(pSource->rooks);
>	T pd(pSource->occup);
>	pd = ~pd;
>	gd |= pd & (gd>>8);
>	pd &= pd>>8;
>	gd |= pd & (gd>>16);
>	pd &= pd>>16;
>	gd |= pd & (gd>>32);
>	(gd>>8).store(&pTarget->dn);
>}
>
>
>>
>>question: would that be slower than writing it out? will the compiler inline
>>such functions anyway?
>
>Depends on the optimization settings i guess and how often the fuctions are
>used. You may use explicit inline or __forceinline anyway.
>
>
>>
>>3) the ray-tracing method is probably slower, but one of the benefits is that
>>you can continue ray-tracing once you hit an occupied square, so you can get
>>pins or x-ray-attacks. i don't see how to get such attacks easily in rotated
>>bitboard or kogge-stone formalism. am i missing something?
>
>See other post.
>Happy cycle counting.
>
>Cheers,
>Gerd

hi gerd,

thanks for your answers. i have some more on this now: i tested my raytracer vs
kogge-stone in perft in the following way:

tree(depth)
  {
  if depth == 0
     gen_attacks()
  else
     make all moves
       tree(depth-1)
  }

i did a tree(4) or tree(5) on 3 different positions: once in the starting
position, once in a typical midgame position, and once in a wide-open position
with few pawns and all sliders placed so that they have lots of squares to go.

in the start position, gen_attacks with raytracing was much faster than with
kogge-stone. in the "normal" midgame position it was still significantly faster.
in the wide-open position, kogge-stone was slightly faster.

i was surprised with this result, as i would have expected kogge-stone to do
fine. i think the answer must be that very often, you get a break in my
raytracing right on the first step, while kogge-stone has to fill in all
directions, and waste lots of time e.g. for finding backward attacks of a rook
on the first rank which don't exist. perhaps one could exclude slider attacks in
obviously bad directions from being generated at all to start with, i.e. before
generating backward attacks of rooks, do an & with rows 2,3,4,5,6,7,8. rooks
have this tendency to live on the edge, so it might speed up things a bit.

i see that i must think a bit more in terms of kogge-stone or bitboard attacks
in general :-)
both russel's suggestion of a double-kogge-stone to find pinned pieces and your
suggestion of a reversed directed attack by opponent king with intersection are
nice.

the main question for me is whether i should bother to rewrite my engine to
rotated BB, or to leave it as it is, or to switch to kogge-stone or dumbfill of
yours (thanks for sharing!), with the 64-bit future in mind. i'm afraid i still
don't know what to do :-)

BTW, i also looked at a version of my raytracer which calls those small
diag-attack() and straight-attack() functions, it was significantly slower than
the version which writes out everything inline. i had my compiler set to "inline
functions: any suitable", but it wasn't inlining these small functions, another
surprise! after __forceinline, it got clearly faster but still slower then the
written-out version.

cheers
  martin



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.