Author: Gerd Isenberg
Date: 07:06:20 10/09/03
Go up one level in this thread
Some more about SSE2-instruction scheduling and OOE on P4:
I tried two routines, where all rook attacks are combined.
First with sequential Kogge Stones, left,up and down (rookAttacks1).
Second (rookAttacks2 ) with same instructions but manually interlaced.
Both routines require all eight xmm-registers on P4.
All CDblBB constructors and operators are __forcedinline.
In a dumb 10**9 loop test, the interlaced rookAttacks2 took 38secs,
rookAttacks1 42 sec on a P4 2.4GHz.
If i look to the generated assembly of rookAttacks2, i found following
instructions, with P4 (AMD64) latencies, ignoring memory access and throughput:
times instruction latency, P4 (AMD64)
45 movdqa 6!(2) 270 (90)
51 pand,por,pxor 2 102 (102)
psrlq,psllq,psubb
16 gp-instructions 0.5-8(ret) ~17 (~17)
___________________________________________________
112 389 (209)
So there are 112 instructions (468 byte) with a sequential latency sum
of 389 cycles, ~162ns on a 2.4Ghz P4. The "best case" dumb loop test shows only
38ns instead of 162ns which is a factor of >4 for parallel out of order
execution. I'm not really sure about that.
Anyway, i'm a bit shocked about the huge movdqa latency on P4, AMD64 needs only
two cycles. I'm very strained about Opteron or Athlon64 times of these routines.
Cheers,
Gerd
Here the two routines:
struct sTarget
{
CDblBB left;
CDblBB right;
CDblBB up;
CDblBB down;
CDblBB hor;
CDblBB ver;
CDblBB all;
};
void rookAttacks1(sTarget* pTarget, const sSource* pSource)
{
CDblBB gl(&pSource->rooks);
CDblBB pl(&pSource->occup);
CDblBB gr = pl ^ (pl - gl - gl);
gr.store(&pTarget->right);
pl = ~pl; // empty squares
CDblBB gu(gl);
CDblBB gd(gl);
CDblBB pu(pl);
CDblBB pd(pl);
pl = pl.notH();
gl |= pl & (gl>>1);
pl &= pl>>1;
gl |= pl & (gl>>2);
pl &= pl>>2;
gl |= pl & (gl>>4);
gl = (gl>>1).notH();
gl.store(&pTarget->left);
gr |= gl;
gr.store(&pTarget->hor);
gu |= pu & (gu<<8);
pu &= pu<<8;
gu |= pu & (gu<<16);
pu &= pu<<16;
gu |= pu & (gu<<32);
gu<<=8;
gu.store(&pTarget->up);
gd |= pd & (gd>>8);
pd &= pd>>8;
gd |= pd & (gd>>16);
pd &= pd>>16;
gd |= pd & (gd>>32);
gd>>=8;
gd.store(&pTarget->down);
gu |= gd;
gu.store(&pTarget->ver);
gr |= gu;
gr.store(&pTarget->all);
}
void rookAttacks2(sTarget* pTarget, const sSource* pSource)
{
CDblBB gl(&pSource->rooks);
CDblBB pl(&pSource->occup);
CDblBB gr = pl ^ (pl - gl - gl);
pl = ~pl; // empty squares
CDblBB gu(gl);
CDblBB gd(gl);
CDblBB pu(pl);
CDblBB pd(pl);
pl = pl.notH();
gu |= pu & (gu<<8);
gd |= pd & (gd>>8);
gl |= pl & (gl>>1);
pu &= pu<<8;
pd &= pd>>8;
pl &= pl>>1;
gu |= pu & (gu<<16);
gd |= pd & (gd>>16);
gl |= pl & (gl>>2);
pu &= pu<<16;
pd &= pd>>16;
pl &= pl>>2;
gu |= pu & (gu<<32);
gl |= pl & (gl>>4);
gd |= pd & (gd>>32);
gu<<=8;
gd>>=8;
gl = (gl>>1).notH();
gr.store(&pTarget->right);
gu.store(&pTarget->up);
gd.store(&pTarget->down);
gl.store(&pTarget->left);
gr |= gl;
gu |= gd;
gr.store(&pTarget->hor);
gu.store(&pTarget->ver);
gr |= gu;
gr.store(&pTarget->all);
}
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.