Author: Gerd Isenberg
Date: 14:27:45 09/27/04
Go up one level in this thread
On September 25, 2004 at 06:21:13, martin fierz wrote:
>aloha!
oloha!
a bit more specific to your other questions.
<snip>
>
>here are my first 3 questions
>
>1) does anybody here have an idea how these 3 ways of generating attacks for
>sliders compare speedwise for both 32-bit and 64-bit processors? (reason: if
>kogge-stone will be comparable to rotated bitboards, i'd rather not bother to
>implement rotated bitboards, and if both are not much better than my current
>code, i'd also rather not bother with a rewrite).
Comparing the pure attack getters for one sliding piece
is clearly in favour for rotated.
Instead of 2*2*14 or 2*2*16 (2 directions, 2*32-bit) alu-instructions for
Kogge-Stone you have some rank and file frickling to "shift,and" the rotated
occupied board to get the occupied index, some address calculation and one
memory lookup. A drawback with 32-bit might be a function call for variable
64-bit shift.
But on 32-bit architectures like x86-32 Kogge-Stone is too expensive due to
stack trafic.
Considering the huge caches, rotated is fastest on 64-bit too.
64-bit shift is cheap, so it's like a simple memory access.
Incremental updating three additional rotated bitboards or even rotated indices
is not that expensive. A pro for fillalgos is the use of bb & -bb instead of
bitscanning - or even better to process all pieces of one side simultaniously.
That requires to keep disjoint direction attacks in any case.
Another point in favour to Kogge-Stone is the ability to process many directions
or different disjoint sets in parallel, see my quad-bitboard Multi- Kogge-Stone
approach.
While rotated lookups, even in 2. or 1.level data cache, have a more sequential
nature, piped architectures with huge register files and some alus may process
up to four or more pure register instructions in parallel.
Even if it takes some once allocated memory, keeping disjoint as well as
combined attacks for a while has some advantages. E.g. pinned piece or remove
checker detection is much easier and branchless unless you have sliding pieces.
The question is whether it is worth to keep direction wise attacks even for
disjoint pieces.
>
>2) in the code above, to make things more readable and compact i could write
>functions like this (same for kogge-stone):
>
>forwardattack(org, empty)
> {
> int64 attack;
> x = forward(org); // x traces a ray forward; forward(x) is a macro
> // forward(x) <--> (x = x<<8);
> while(x & empty)
> {
> attack |= x;
> x = forward(x);
> }
> return attack;
> }
>
>and then put together my rook attack function as
>
>for-all-rooks-do...
> attack =
>forwardattack(rooksquare)|backwardattack(rooksquare)|leftattack(rooksquare)|rightattack(rooksquare);
>
>this would be much much cleaner instead of writing out the code all the time for
>all attackers.
Yes, use small functions with __inline or __forceinline.
The compiler (at least VC2005beta) is able to schedule a rather independent
instruction mix over several independent instructions.
With my xmm-wrapper and AMD64 32-bit mode, this a actually fastest for rooks,
80 cycles for two disjoint rook sets, very nice assembly scheduling:
(i guess in 64-bit mode it's less than 30 cycles, using three times GPR)
void rookAttacks(sTarget* pTarget, const sSource* pSource)
{
riAttacks<XMM>(pTarget, pSource);
leAttacks<XMM>(pTarget, pSource);
upAttacks<GPR>(pTarget, pSource);
dnAttacks<XMM>(pTarget, pSource);
}
template <class T> __forceinline
void riAttacks(sTarget* pTarget, const sSource* pSource)
{
#if 1
T rook(pSource->rooks);
T occu(pSource->occup);
(occu ^ ((occu - rook) - rook)).store(&pTarget->ri);
#else
T gr(&pSource->rooks);
T pr(&pSource->occup);
pr = (~pr).notA();
gr |= pr & (gr<<1);
pr &= pr<<1;
gr |= pr & (gr<<2);
pr &= pr<<2;
gr |= pr & (gr<<4);
(gr<<1).notA().store(&pTarget->ri);
#endif
}
template <class T> __forceinline
void leAttacks(sTarget* pTarget, const sSource* pSource)
{
T gl(pSource->rooks);
T pl(pSource->occup);
pl = (~pl).notH();
gl |= pl & (gl>>1);
pl &= pl>>1;
gl |= pl & (gl>>2);
pl &= pl>>2;
gl |= pl & (gl>>4);
(gl>>1).notH().store(&pTarget->le);
}
template <class T> __forceinline
void upAttacks(sTarget* pTarget, const sSource* pSource)
{
T gu(pSource->rooks);
T pu(pSource->occup);
pu = ~pu;
gu |= pu & (gu<<8);
pu &= pu<<8;
gu |= pu & (gu<<16);
pu &= pu<<16;
gu |= pu & (gu<<32);
(gu<<8).store(&pTarget->up);
}
template <class T> __forceinline
void dnAttacks(sTarget* pTarget, const sSource* pSource)
{
T gd(pSource->rooks);
T pd(pSource->occup);
pd = ~pd;
gd |= pd & (gd>>8);
pd &= pd>>8;
gd |= pd & (gd>>16);
pd &= pd>>16;
gd |= pd & (gd>>32);
(gd>>8).store(&pTarget->dn);
}
>
>question: would that be slower than writing it out? will the compiler inline
>such functions anyway?
Depends on the optimization settings i guess and how often the fuctions are
used. You may use explicit inline or __forceinline anyway.
>
>3) the ray-tracing method is probably slower, but one of the benefits is that
>you can continue ray-tracing once you hit an occupied square, so you can get
>pins or x-ray-attacks. i don't see how to get such attacks easily in rotated
>bitboard or kogge-stone formalism. am i missing something?
See other post.
Happy cycle counting.
Cheers,
Gerd
>
>cheers
> martin
>
>
>
>
>occupied = p->
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.