Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: 0x88 and move generator speed

Author: Dan Newman

Date: 01:16:21 02/03/01

Go up one level in this thread


On February 02, 2001 at 22:43:16, Vincent Diepeveen wrote:

>On February 02, 2001 at 02:57:58, Dan Newman wrote:
>
>>On February 01, 2001 at 21:24:13, Vincent Diepeveen wrote:
>>
>>>On February 01, 2001 at 07:12:07, Dan Newman wrote:
>>>
>>>>On February 01, 2001 at 03:36:08, David Rasmussen wrote:
>>>>
>>>>>On January 31, 2001 at 07:15:59, Dan Newman wrote:
>>>>>
>>>>>>On January 30, 2001 at 06:26:17, David Rasmussen wrote:
>>>>>>
>>>>>>It just means I don't use rotated bitboards.  I have a bitboard for each
>>>>>>different piece type (12 bitboards) + occupied square bitboards for each
>>>>>>color (2) + an occupied square bitboard (1).  This means make() and undo()
>>>>>>are a bit cheaper since I don't have to update rotated bitboards.  Also,
>>>>>>there are a few large data structures (arrays of bitboards) that aren't
>>>>>>needed as well--so fewer memory hits.
>>>>>>
>>>>>>I treat the non-sliding pieces more or less like any other bitboard program,
>>>>>>but for sliding pieces I occasionally resort to scanning the board.  I guess
>>>>>>it might be thought of as a sort of hybrid of bitboard and mailbox.
>>>>>>
>>>>>>I suspect (but don't have any hard data) that this is cheaper than doing
>>>>>>rotated bitboards, and it's also much simpler to implement.
>>>>>>
>>>>>>OTOH, I think Bas (Hamstra) may have switched to rotated bitboards and
>>>>>>found them to be faster...
>>>>>>
>>>>>>-Dan.
>>>>>
>>>>>OK, so essentially you don't have any smart way of calculating the attackboard
>>>>>of sliding pieces??
>>>>>
>>>>>I mean, the normal method of extracting the state of the file, rank or >diagonal,
>>>>>and using this to quickly calculate an attackboard, cannot be used in a
>>>>>non-rotated bitboard design.
>>>>>
>>>>>Isn't this just worse in all cases compared to rotated bitboards (except for >the
>>>>>simpler design maybe)?
>>>>
>>>>I calculate a "pseudo-attack" bitboard for sliders.  (That's a bitboard
>>>>that has a bit turned on for each enemy piece on which the slider is
>>>>bearing, ignoring any blocking pieces.)  Then I start extracting the
>>>>coordinates of the "victims".  For each such coordinate I test to see if
>>>>any pieces are blocking the attack using a blocking-mask bitboard which
>>>>has all the bits turned on for squares that lie between the attacker and
>>>>victim.  If no blockers are there, we have a real victim.
>>>>
>>>>I don't think this differs much in cost from rotated bitboards.  In
>>>>some cases it's probably faster (sparse board), in others it's probably
>>>>slower than rotated bitboards.  Even if rotated bitboards happen to be a
>>>>little faster generating captures, they will certainly slow down make()
>>>>and undo().
>>>>
>>>>I don't think that it's really easy to figure out which is better, either by
>>>>reason or by experiment.  You can't just count instructions executed--there
>>>>are memory/cache effects to be considered.  Also, the quality of the
>>>>implementations will vary, and that is likely to have an even bigger effect.
>>>>So I might add rotated bitboards and find it slows me down--only because of
>>>>poor design.
>>>>
>>>>-Dan.
>>>
>>>Could you post please your LastOne() function
>>>in assembly or if it is in C and also write down what
>>>processor you use?
>>>
>>
>>I'm currently (mostly) using a P3 clocked at 980 MHz.
>>
>>>Of course only if it's not the same as crafty, if it is
>>>the same i don't see how you can ignore the macro that
>>>eats most system time of the entire generation!
>>
>>I haven't attempted to be symmetrical, so I just have one function
>>called next_one() which uses BSF and is a member function of my
>>bitboard class.  This function not only extracts the bit index but
>>also clears it.
>>
>>//
>>// This function gives access to the BSF
>>// (bit scan forward) opcode and is used
>>// to extract bit indices from bitboards.
>>//
>>#ifdef _MSC_VER
>>#pragma warning( push)
>>#pragma warning( disable: 4035)
>>inline int bsf( uint32 bitpat)
>>{
>>    __asm bsf eax, dword ptr bitpat
>>}
>>#pragma warning( pop)
>>
>>
>>    //
>>    // Bit index extraction.  Member function of bitboard_t.
>>    //
>>    int next_one() {
>>        if( lower ) {
>>            int index = bsf( lower);
>>            lower ^= (1ul << index);
>>            return index;
>>        } else {  // Not testing upper since we won't run this on empty...
>>            int index = bsf( upper);
>>            upper ^= (1ul << index);
>>            return index + 32;
>>        }
>>    }
>>
>>The bsf() function has the only asm used in my program--just one line.
>>I've tried coding more of next_one() in assembly, but I've only managed
>>to slow the program down...
>>
>>"upper" and "lower" are just the two 32-bit halves of the bitboard.
>>
>>-Dan.
>
>Ah yes, this is not so bad except for one real slow branch which
>might get mispredicted!
>
>Perhaps start to rewrite that one a bit too till there is not
>a single 'if then else' left?
>
>Because on average you can easily use 15 extra instructions to
>get rid of it!
>
>Even more on the P4 btw, except those shift instructions,
>to my amazement they were slower on the P4, but that might change
>next release of P4...

I can't remember if I actually tried eliminating that branch or not
since it causes a little code bloat, but I did think about it.  If I
tried it, then it probably slowed me down a bit, else I would have it
in my code now...

The idea I had is that once the lower 32 bits is exhausted of one-bits,
then you really don't need to test it anymore.  All the code that uses
next_one() would be duplicated so that we could use a next_lower_one()
and a next_upper_one() function.

Instead of

    while( victims.exist() )
        int to = victims.next_one();
        ....
    }

I'd use

    while( victims.lower_exist() ) {
        int to = victims.next_lower_one();
        ....
    }
    while( victims.upper_exist() ) {
        int to = victims.next_upper_one();
        ....
    }

This would double the size of movegen(), so it might lose all of the
gain by increasing its cache footprint--not that it's really huge...

-Dan.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.