Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Reverse Bitboards

Author: Sander de Zoete

Date: 14:02:10 01/13/03

Go up one level in this thread


On January 13, 2003 at 16:50:46, Russell Reagan wrote:

>On January 13, 2003 at 16:16:11, Sander de Zoete wrote:
>
>>Does anyone use reverse bitboards? I'm writing a new chess program "Preatorian",
>>which uses reverse bitboards. Up till now it looks good, but is still 10% slower
>>on capture generation than my 0x88 program "Shark". And Shark is slow, it must
>>say.
>>
>>I add some code to see if i'm doing it correct? What could be improved?
>
>Perhaps you could describe breifly how reverse bitboards work compared to
>"regular" bitboards. I'm sure we could reverse engineer the information from
>your code, but a simple explaination would be quicker :)
>
>For example, a "regular" bitboard represents the answer to a yes or no question
>about each square on the board. The 'white pawns' bitboard answers the question,
>"Does this square contain a white pawn?"
>
>Could you give us a similar explaination of what reversed bitboards do?

The reverse bitboards do the same as rotated bitboards, but are based on the
following:
1. a bitboard manipulation trick x ^ (x - 2).
2. because of 1. no need for lookup tables.

I add the text of the website I accidently run into and captured my interest, so
much, that I start writing Preatorian.

The Hyperbola Project - New Technologies
    What follows here is merely a glimpse at the new advances in computer chess
that the Hyperbola Project will include. The revolutionary design will generate
enormous speed improvements, optimized specifically for the Pentium III
processor's SIMD and MMX enhamcements. The downside is that the initial releases
of Hyperbola will require a P-III system to operate. As the core engine is being
written in assembly code, it is very difficult to optimize for more than one
chip at a time.

    The past two years have yielded a serious look at the way computer chess
programs operate today. The most significant problem with PC-oriented (non
64-bit) programs is the lack of speed associated with doing 64-bit arithmetic in
a 32-bit environment.

    MMX technology introduced some 64-bit operations to the Pentium family of
microprocessors, which do increase performance noticeably, but not optimally.
When using MMX, many operations still have to be completed in the old 32-bit
system, which degrades the performance gain when the data has to be exchanged
back and forth between the regular registers and the MMX registers.

    The improvements shown below also address another significant bottleneck in
chess programming: memory speeds. Despite the fact that Hyperbola is designed
for systems with extremely fast memory (RDRAM), memory is still a bottleneck for
some calculations, which require a very tight dependency chain. Since a memory
read from the fastest (L1) cache has a latency of three clock cycles, the
processor remains stalled in a tight dependency chain such as this, where no
other calculations can be completed while the processor waits for the memory
read to complete.

Bitboard Design and Sliding Piece Attack Generation
    Generating the attacks for sliding pieces proved to be a significant
bottleneck in the early trials. For example, before generating legal moves for a
position, the program must determine whether the king is in check. To do this,
the program has to find the attacks of every type of piece from the king's
square and check if an appropriate enemy piece resides there. Since this cannot
be done while analyzing other moves, this is a huge slowdown under the current
technology, which uses four sets of bitboards (three rotated) and several memory
lookups to determine the correct attacks.

    Hyperbola's design is quite different in this respect from other programs.
Hyperbola introduces the concept of a reversed bitboard in place of three
rotated bitboards, and TOTALLY eliminates the lookup tables in favor of simple
calculations, all of which can be done in the 64-bit MMX registers.

    With the research still in progress on this topic, it is not clear as to
exactly how much performance can be enhanced by this new technique, but it is
clear that performance will be improved very significantly, especially with MMX
hardware. Processors that are completely 64-bit, such as the Intel Itanium, will
also be very much helped by this technique.

Now, how do reverse bitboards help generate sliding piece attacks?
    First, let's cover what has to be maintained by the program. There is the
normal set of bitboards, all generated in the normal direction. And there is a
single reverse bitboard, which roughly takes the place of the three rotated
bitboards in most programs today. The reverse bitboard contains the same
information as the forward bitboard which denotes all occupied squares, but it
is in reverse, that is, bit x in the forward bitboard corresponds to bit 63-x in
the reverse bitboard.

What benefit comes from having a reversed bitboard in place of rotated
bitboards?
Maintenance
    A single reversed bitboard is easier to maintain than one rotated bitboard,
let alone three. This is because a table lookup is required to convert a bit
from the forward format to most rotated formats. With the reverse bitboard,
simply subtract the bit number from 63. Considering that only one of these
bitboards has to be maintained instead of three, this issue is important.

The Calculations
    When taking account of memory latencies, calculating the piece attacks using
the forward and reverse bitboards can be done significantly faster due to total
independence on lookup tables and complex calculations. There are a few slight
snags with the diagonal calculations, but they are minor.

    The benefit of having the forward and reverse bitboards really shines in
light of a very important bit manipulation trick: x XOR (x - 2) contains bits
set to one, starting from the second bit, and stopping at the first bit of x
that is one, but including it.

    Now, if x is the bitboard of all occupied squares, and it is shifted to the
right so that the square for which attacks are being calculated becomes bit
zero, then XORing x with (x - 2) gives a mask of all squares attacked by the
piece on the desired square, once the result is shifted back to the left. It
does not include the piece that is making the attack, but it does include the
first occupied square. This is exactly the result the lookup tables return!

    The snags come from wraparound - that is, when the mask goes to the end of
the file, rank, or diagonal... and keeps going where it shouldn't. This can be
easily fixed in rank/file attacks by masking only the rank or file in question,
but it is more complicated on diagonals.

Where do the reverse bitboards come in?
     The regular bitboard is fine for calculating any attacks that go in the
positive direction from the starting square (by masking the rank, file, or
diagonal), but the bit-manipulating trick won't work in the opposite direction.
The solution is simple: calculate attacks in the opposite direction using a
reverse bitboard, so that on that bitboard they extend in the positive direction
and can be calculated as above.

    The real problem comes in trying to combine the forward and reverse
bitboards - there is just no simple way to un-reverse the bitboard. The
technique to do the switch is not terribly cumbersome, but it certainly cannot
be used in cases where only one set of attacks is being generated... the delay
far outweighs the gain.

    However, in Hyperbola, this delay can be managed for an overall gain. Though
it is unclear whether these types of calculations will even be used in
generating legal moves, due to an incremental move generator, the two bitboards
can be scanned separately, converting the returns from the bitscans from reverse
to forward (by subtracting from 63) instead of converting the entire bitboard.

    A more pertinent issue arises with determining whether the king is in check,
which is done before generating any legal moves. Possible ways to go about this
include converting the reverse bitboard, which is probably not optimal,
maintaining reverse bitboards for the sliding pieces of both colors, which is
possible, and testing the individual squares on the bitboards both in forward
and in reverse, which is nearly out of the question. In the newer chip
architectures any unpredictable branches severely degrade performance, and the
technique to avoid that is also slow because it can only be used on 32-bit
registers.

    The most viable option is to maintain four reverse bitboards, one for the
bishops/queens and rooks/queens of each color. Updating them does not degrade
performance because only one of the four must be updated for any move, or none
at all, in addition to the regular updates.

    With the extra reverse bitboards, it is now very easy to determine if the
king is in check without any memory tables. First, calculate the attacks from
the king's square for both types of sliding pieces, and simply AND them with the
appropriate bitboards of enemy pieces. After that, OR all the results of the
comparisons together, and take a single branch.

    A possible shortcut exists here still. If the king is not in check, which is
almost always the case, and all of the possible sliding piece lines are closed
or empty, which is usually true, then the king cannot be checked by a sliding
piece. The merit of this idea is questionable, but a more concrete idea is much
more helpful. If the previously used sliding piece attack bitboards are saved
and it is determined that none of the squares on those bitboards have changed in
the bitboard of occupied pieces, then the bitboards can be used again and the
calculations skipped. If this shortcut is used, then it might be more economical
to reverse the bitboard(s) after calculating them and ignoring the additional
reverse bitboards. That depends on the hit rate.

How can the reverse bitboards be converted back to normal?
    The fastest way to do this isn't a huge deal, but it cannot be run in
parallel with other operations and does degrade performance if it is run too
often. Suppose we have the lower 16 bits of a bitboard, x: FEDCBA9876543210.

Load the constant, k = 5555555555555555h

x = [(x shl 1) and k] or [(x and k) shr 1] EFCDAB8967452301

Load the constant, k = 3333333333333333h

x = [(x shl 2) and k] or [(x and k) shr 2] CDEF89AB45670123

Load the constant, k = 0F0F0F0F0F0F0F0Fh

x = [(x shl 4) and k] or [(x and k) shr 4] 89ABCDEF01234567

Now, due to the availability of a MMX shift instruction that operates on 16-bit
words, the bytes can now be switched more easily

x = (x shlw 8) or (x shrw 8) 0123456789ABCDEF

And finally, the pshufw on the P-III reorganizes the words in a single
instruction

. x = x pshufw 00011011b

    Since all of this arithmetic can be done in the MMX registers, it runs very
much in parallel and can complete in a reasonable time frame on the P-III. If
this algorithm were to be redesigned for P-II or PMMX architectures (without the
pshufw) it would be a little slower since two more steps need to be completed.
Trying to do this without 64-bit support would be extremely slow due to the
extended-precision operands, despite that most of it could be done in parallel.
(This algorithm is also in parallel, so it would probably take over twice as
long.)

Hyperbola - Coming 2001
    Well, the new use of bitboards to accomodate simple 64-bit operations
promises very high-speed calculations for Hyperbola, especially considering the
decreased reliance on memory tables. The large tables just for sliding piece
calculations under current programs can't even fit into the Level 2 cache of the
Coppermine P-III, although most recently-used addresses would remain cached.
However, even an access to the L2 cache degrades performance: the Coppermine,
with the full-speed ATC, still has approximately a 9-clock latency. Since the
16K L1 cache can only hold 512 cache lines, it is probable that most of the
recently used table values will get bumped to the L2 cache, or even to memory,
which has a 12-clock latency even with PC800 RDRAM.
Hyperbola is a long way away from first release or even alpha testing, but the
technological advancements, such as the concept of reverse bitboards explained
in detail above, promise to give this new engine a significant edge. It is
possible that, even with a relatively thorough evaluation function, the nps
(Nodes Per Second) rate could exceed 10,000,000.

Sander



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.