Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: 0x88 and move generator speed

Author: Vincent Diepeveen
Date: 12:01:05 02/04/01
On February 03, 2001 at 04:16:21, Dan Newman wrote:

>On February 02, 2001 at 22:43:16, Vincent Diepeveen wrote:
>
>>On February 02, 2001 at 02:57:58, Dan Newman wrote:
>>
>>>On February 01, 2001 at 21:24:13, Vincent Diepeveen wrote:
>>>
>>>>On February 01, 2001 at 07:12:07, Dan Newman wrote:
>>>>
>>>>>On February 01, 2001 at 03:36:08, David Rasmussen wrote:
>>>>>
>>>>>>On January 31, 2001 at 07:15:59, Dan Newman wrote:
>>>>>>
>>>>>>>On January 30, 2001 at 06:26:17, David Rasmussen wrote:
>>>>>>>
>>>>>>>It just means I don't use rotated bitboards.  I have a bitboard for each
>>>>>>>different piece type (12 bitboards) + occupied square bitboards for each
>>>>>>>color (2) + an occupied square bitboard (1).  This means make() and undo()
>>>>>>>are a bit cheaper since I don't have to update rotated bitboards.  Also,
>>>>>>>there are a few large data structures (arrays of bitboards) that aren't
>>>>>>>needed as well--so fewer memory hits.
>>>>>>>
>>>>>>>I treat the non-sliding pieces more or less like any other bitboard program,
>>>>>>>but for sliding pieces I occasionally resort to scanning the board.  I guess
>>>>>>>it might be thought of as a sort of hybrid of bitboard and mailbox.
>>>>>>>
>>>>>>>I suspect (but don't have any hard data) that this is cheaper than doing
>>>>>>>rotated bitboards, and it's also much simpler to implement.
>>>>>>>
>>>>>>>OTOH, I think Bas (Hamstra) may have switched to rotated bitboards and
>>>>>>>found them to be faster...
>>>>>>>
>>>>>>>-Dan.
>>>>>>
>>>>>>OK, so essentially you don't have any smart way of calculating the attackboard
>>>>>>of sliding pieces??
>>>>>>
>>>>>>I mean, the normal method of extracting the state of the file, rank or >diagonal,
>>>>>>and using this to quickly calculate an attackboard, cannot be used in a
>>>>>>non-rotated bitboard design.
>>>>>>
>>>>>>Isn't this just worse in all cases compared to rotated bitboards (except for >the
>>>>>>simpler design maybe)?
>>>>>
>>>>>I calculate a "pseudo-attack" bitboard for sliders.  (That's a bitboard
>>>>>that has a bit turned on for each enemy piece on which the slider is
>>>>>bearing, ignoring any blocking pieces.)  Then I start extracting the
>>>>>coordinates of the "victims".  For each such coordinate I test to see if
>>>>>any pieces are blocking the attack using a blocking-mask bitboard which
>>>>>has all the bits turned on for squares that lie between the attacker and
>>>>>victim.  If no blockers are there, we have a real victim.
>>>>>
>>>>>I don't think this differs much in cost from rotated bitboards.  In
>>>>>some cases it's probably faster (sparse board), in others it's probably
>>>>>slower than rotated bitboards.  Even if rotated bitboards happen to be a
>>>>>little faster generating captures, they will certainly slow down make()
>>>>>and undo().
>>>>>
>>>>>I don't think that it's really easy to figure out which is better, either by
>>>>>reason or by experiment.  You can't just count instructions executed--there
>>>>>are memory/cache effects to be considered.  Also, the quality of the
>>>>>implementations will vary, and that is likely to have an even bigger effect.
>>>>>So I might add rotated bitboards and find it slows me down--only because of
>>>>>poor design.
>>>>>
>>>>>-Dan.
>>>>
>>>>Could you post please your LastOne() function
>>>>in assembly or if it is in C and also write down what
>>>>processor you use?
>>>>
>>>
>>>I'm currently (mostly) using a P3 clocked at 980 MHz.
>>>
>>>>Of course only if it's not the same as crafty, if it is
>>>>the same i don't see how you can ignore the macro that
>>>>eats most system time of the entire generation!
>>>
>>>I haven't attempted to be symmetrical, so I just have one function
>>>called next_one() which uses BSF and is a member function of my
>>>bitboard class.  This function not only extracts the bit index but
>>>also clears it.
>>>
>>>//
>>>// This function gives access to the BSF
>>>// (bit scan forward) opcode and is used
>>>// to extract bit indices from bitboards.
>>>//
>>>#ifdef _MSC_VER
>>>#pragma warning( push)
>>>#pragma warning( disable: 4035)
>>>inline int bsf( uint32 bitpat)
>>>{
>>>    __asm bsf eax, dword ptr bitpat
>>>}
>>>#pragma warning( pop)
>>>
>>>
>>>    //
>>>    // Bit index extraction.  Member function of bitboard_t.
>>>    //
>>>    int next_one() {
>>>        if( lower ) {
>>>            int index = bsf( lower);
>>>            lower ^= (1ul << index);
>>>            return index;
>>>        } else {  // Not testing upper since we won't run this on empty...
>>>            int index = bsf( upper);
>>>            upper ^= (1ul << index);
>>>            return index + 32;
>>>        }
>>>    }
>>>
>>>The bsf() function has the only asm used in my program--just one line.
>>>I've tried coding more of next_one() in assembly, but I've only managed
>>>to slow the program down...
>>>
>>>"upper" and "lower" are just the two 32-bit halves of the bitboard.
>>>
>>>-Dan.
>>
>>Ah yes, this is not so bad except for one real slow branch which
>>might get mispredicted!
>>
>>Perhaps start to rewrite that one a bit too till there is not
>>a single 'if then else' left?
>>
>>Because on average you can easily use 15 extra instructions to
>>get rid of it!
>>
>>Even more on the P4 btw, except those shift instructions,
>>to my amazement they were slower on the P4, but that might change
>>next release of P4...
>
>I can't remember if I actually tried eliminating that branch or not
>since it causes a little code bloat, but I did think about it.  If I
>tried it, then it probably slowed me down a bit, else I would have it
>in my code now...
>
>The idea I had is that once the lower 32 bits is exhausted of one-bits,
>then you really don't need to test it anymore.  All the code that uses
>next_one() would be duplicated so that we could use a next_lower_one()
>and a next_upper_one() function.
>
>Instead of
>
>    while( victims.exist() )
>        int to = victims.next_one();
>        ....
>    }
>
>I'd use
>
>    while( victims.lower_exist() ) {
>        int to = victims.next_lower_one();
>        ....
>    }
>    while( victims.upper_exist() ) {
>        int to = victims.next_upper_one();
>        ....
>    }
>
>This would double the size of movegen(), so it might lose all of the
>gain by increasing its cache footprint--not that it's really huge...
>
>-Dan.

Depends on the number of extra instructions of course you need. If
you need more as 30 instructions from which a part is complex then
it of course is slower. For the code cache size usually this problem
is not there as most likely this isn't the biggest issue as your
prog probably fits within the L2 cache easily, so the processor can
read easily a lot of code to the L1 code cache without problems
and without delaying you!

Things are of course different if you talk about a K6-2 or something
like an old macintosh computer where the L2 cache is at 66Mhz busspeed
(correct me if i'm wrong isn't the G4 the only one fixing this
problem?)... ...getting inside the L1 cache with the entire program
is adviceable then :)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.