Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: 64 bits

Author: Eugene Nalimov

Date: 22:39:11 06/20/02

Go up one level in this thread


On June 20, 2002 at 17:59:13, Robert Hyatt wrote:

>On June 20, 2002 at 16:50:46, Eugene Nalimov wrote:
>
>>Ok, here is why I don't expect the speedup will be more than ~10%. I already
>>gave that example here a year or two ago, but I am posting it again.
>>
>>Let's assume that we have a superscalar CPU with (moderate) pipeline length of
>>12. Each cycle it can issue 2 instructions. Each memory reference takes 2 cycles
>>(data is mainly in the L1 cache), register operation takes 1 cycle,
>>hard-to-predict branch takes 6 cycles (in half of the cases we get it right, so
>>there is no penalty, in other half there is 12 cycles penalty for the
>>mispredicten branch). Model is oversimplifcated, but it will give you an idea.
>>
>>In the source we have something like
>>    if (p->pawns & some_bit_mask)
>>        ...
>>On the 64-bit CPU we'll have
>>    ld  R1=[R2+offsetof(pawns)]
>>    and R3=R1,some_bit_mask
>>    cmp R3,0
>>    bne ...
>>Total time is 10 cycles.
>
>Right..  But shouldn't we depend on the compiler to "lift" another similar
>set of instructions from below, and interlace them so that they schedule in
>pairs or triples as well?
>
>IE the Cray compiler certainly does this well...

That depends. Sometimes the compiler can do that. More often not. Fortran
compiler can do that more often than C compiler (even for integer code) due to
the fact that Fortran is better language than C for the alias analysis.

But even if the compiler can move some instructions up, there are unused issue
slots for the 32-bit CPUs, too:
    ld  R1=[R2+offsetof(pawns)]
    ld  R3=[R2+offsetof(pawns)+4]
// 2 slots here
    and R4=R1,low_4_bytes(some_bit_mask)
    and R5=R3,high_4_bytes(some_bit_mask)
    or  R6=R4,R5
// 1 slot here
    cmp R6,0
// 1 slot here
    bne ...

Even if you'll have to add the extra cycle, 32-bit version still will be 12
cycles -- compared to 10 cycles of the 64-bit version. And that for the bitboard
operations. There is lot of code in Crafty's hot paths that will work like charm
on the 32-bit CPUs -- function calls, operations on integers, etc. Trust me, I
know :-)

And each hash table probe (main memory access) costs several hundred CPU cycles
on both 32-bit and 64-bit system.

Eugene

>>
>>On the 32-bit CPU we'll have
>>    ld  R1=[R2+offsetof(pawns)]
>>    ld  R3=[R2+offsetof(pawns)+4]
>>    and R4=R1,low_4_bytes(some_bit_mask)
>>    and R5=R3,high_4_bytes(some_bit_mask)
>>    or  R6=R4,R5
>>    cmp R6,0
>>    bne ...
>>Total time is 11 cycles.
>>
>>And there is a surprisingly lot of places in the code where even
>>bitboard-oriented program works with "native length" data -- all operations with
>>alpha, beta, score, loop counters, etc.
>>
>>Eugene
>
>I don't disagree there to a point.  But in places like movegen, evaluate,
>make/unmake, the majority of the things being fiddled with are 64 bit
>values.  That is why I have never tried to imply that a 64 bit machine would
>be twice as fast, period.  But it should be twice as fast on the parts of the
>engine that are really beating on 64 bit values.  Such as the above ones...
>
>Even the cray had both 32 and 64 bit registers so that you could be doing
>32 bit stuff (mainly address/index calculations) while the 64 bit registers
>were doing 64 bit stuff.
>
>>
>>On June 20, 2002 at 16:02:04, Robert Hyatt wrote:
>>
>>>On June 20, 2002 at 15:58:29, Tom Kerrigan wrote:
>>>
>>>>On June 20, 2002 at 15:25:36, Sune Fischer wrote:
>>>>
>>>>>So what you are saying is that you can't just count the number of operation and
>>>>>use that to pridict the speed?
>>>>
>>>>Counting the operations is difficult. You can't just go through the source code
>>>>and count them because that doesn't tell you how often the code is run.
>>>>
>>>>>Now if Crafty is 50% 64 bit operations then we can expect a factor of two in
>>>>>speedup on that 50%, right?
>>>>
>>>>I think Crafty must be << 50% 64 bit operations. Think about all the data that
>>>>Crafty must operate on that isn't bitboards...
>>>>
>>>>-Tom
>>>
>>>
>>>Which data is that?  The top 80% in profiling depends on bitboards heavily.
>>>generating moves, evaluating positions, updating the bitmaps in make/unmake,
>>>detecting checks, evaluating Swap().
>>>
>>>I doubt it is << 50% (where << typically means "much less than").



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.