Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question to Bob: Crafty , Alpha and FindBit()

Author: Vincent Diepeveen

Date: 05:24:55 06/11/98

Go up one level in this thread



On June 10, 1998 at 19:46:00, Robert Hyatt wrote:

>On June 10, 1998 at 16:56:58, Eugene Nalimov wrote:
>
>
>>Sorry Bob,
>>
>>But if you'll look in the other Intel manual - AP-526,
>>"Optimization for Intel's 32-Bit Pcocessors" (available
>>for download from Intel web site), section 2.4.4 (Branch
>>Target Buffer for Pentimu Pro Processor), you can read
>>
>>"The penalty for mispredicted branches is at least 9
>>cycles (the length of the In-Order Issue Pipeline) of
>>lost instruction fetch, plus additional time spent
>>waiting for the mispredicted branch instruction to
>>become the oldest instruction in the machine and retire.
>>This penalty is non-deterministic, dependant upon
>>execution circumstances, but experimentation shows that
>>this is nominally a total of 10-15 cycles".
>>
>>Also, if you remember, when Crafty still used assembly
>>code, and I removed branches in several routines
>>(including, BTW, FirstOne), Crafty become 1.5% faster.
>>Code size decreased very slightly, there was still cost
>>of function call (not at execution time - it greatly
>>complicated optimizer/register allocation work, with
>>only 7 registers available), so I think that main
>>speedup come from removing of mispredicted branches.
>>
>>Eugene
>>
>
>no argument from me there, because your comments don't affect what I
>told vincent...  branches will not slow his program 400%, ever.  25%
>was the absolute upper limit I would predict assuming his branches are
>so evenly taken/not-taken that prediction fails every time.  9 clocks
>is less than a cache-miss...

more like 5 to 6 times.

Same for other programs.

I generate knight moves in 1 clock cycle.

Branch prediction+partial register stall are 22 clocks if i'm unlucky.

First time it gets into that routine it predicts wrong. next 7 times it
predicts right.
then 8th time it predicts wrong for knight (assuming it has 8 squares).
Every generation i have a partial register stall.

The same trouble i have in my mobility evaluation, and in all other
evaluations.
if i get from the l1 cache something out of board[64] then this is
damned fast,
but the compare doesn't get predicted correctly, as it's a complex
evaluation.
hard to predict.

So branch prediction is the main problem. above that for Pro200 partial
register stalls also the problem, but minor compared to branch
prediction

You count it.

I have done lot of effort to get datastructure within L2 cache and
now move generation all together fits within 25kilobytes.

That's way less than 256kb cache, or the coming caches.
No cache mishit, and cache gets after program start filled within a
millisecond.

>1.5% I remember.. but remember, that is nowhere near the 400% faster
>figure Vincent mentioned, "if branch prediction was better".  That won't
>ever happen...



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.