Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Why bitboards at all?

Author: Ed Schröder

Date: 00:14:56 06/22/00

Go up one level in this thread


On June 22, 2000 at 01:53:30, Eugene Nalimov wrote:

>On June 22, 2000 at 01:06:43, Ed Schröder wrote:
>
>>On June 21, 2000 at 21:51:01, Robert Hyatt wrote:
>>
>>>On June 21, 2000 at 19:14:24, Tom Kerrigan wrote:
>>>
>>>>On June 21, 2000 at 17:16:05, Robert Hyatt wrote:
>>>>
>>>>>On June 21, 2000 at 14:52:34, Tom Kerrigan wrote:
>>>>>
>>>>>>On June 21, 2000 at 14:38:23, Christophe Theron wrote:
>>>>>>
>>>>>>>On June 21, 2000 at 13:33:14, Tom Kerrigan wrote:
>>>>>>>
>>>>>>>>On June 20, 2000 at 21:39:01, Robert Hyatt wrote:
>>>>>>>>
>>>>>>>>>On June 20, 2000 at 15:52:53, Tom Kerrigan wrote:
>>>>>>>>>
>>>>>>>>>>On June 20, 2000 at 15:03:48, Robert Hyatt wrote:
>>>>>>>>>>
>>>>>>>>>>>On June 20, 2000 at 14:07:39, Tom Kerrigan wrote:
>>>>>>>>>>>
>>>>>>>>>>>>On June 19, 2000 at 21:32:56, Robert Hyatt wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>On June 19, 2000 at 20:50:11, John Coffey wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>On June 19, 2000 at 19:48:36, Larry Griffiths wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>I have found bitboards to be an even trade-off on my Pentium system.  I have to
>>>>>>>>>>>>>>>update about 6 bitboards when a piece moves and this generates a lot of
>>>>>>>>>>>>>>>instructions.  I get it back in my IsKingInCheck code so it evens out.  I like
>>>>>>>>>>>>>>>to have fast move generation code, but most of my gains have been through
>>>>>>>>>>>>>>>alpha-beta, hash-table, killer-move and movelist ordering etc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Larry.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Maybe I am too much of a novice, but I don't see yet why I should convert over
>>>>>>>>>>>>>>to bitboards.  Is move generation faster?  If so, why?  My program scans the
>>>>>>>>>>>>>>board and uses simple loops to generate moves.  Do you not have to do loops
>>>>>>>>>>>>>>with bitboards?
>>>>>>>>>>>>>
>>>>>>>>>>>>>Not to generate moves, No. You generate all the sliding piece moves with two
>>>>>>>>>>>>>table lookups...
>>>>>>>>>>>>
>>>>>>>>>>>>Hmmm. I do table lookups all over my program, and none of them seem to be
>>>>>>>>>>>>generating any moves...
>>>>>>>>>>>>The fact is that you DO need to loop to generate moves in a bitboard program.
>>>>>>>>>>>>Maybe it's not the same loop, but it's still a loop.
>>>>>>>>>>>>
>>>>>>>>>>>>-Tom
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>Who says so?  Ask the Dark Thought guys.
>>>>>>>>>>>Or Slate/Atkin.  You only need to loop if you want to take the attack bitmap
>>>>>>>>>>>and turn it into a list of moves.  This is not the way _all_ programs operate
>>>>>>>>>>>(chess 4.x, Dark Thought, others, any of which generate a few moves at a time,
>>>>>>>>>>>then take one and search it, without enumerating the other moves.)
>>>>>>>>>>>
>>>>>>>>>>>So loops are something you do (with bitmaps) if you want to, not because you
>>>>>>>>>>>have to.
>>>>>>>>>>>
>>>>>>>>>>>As far as your table lookups not generating any moves, that is a programming
>>>>>>>>>>>issue.  Mine do.  :)
>>>>>>>>>>
>>>>>>>>>>Maybe your makemove() function can take bitboards as input (i.e., here is a set
>>>>>>>>>>of squares that my pieces can move to) but mine sure can't.
>>>>>>>>>>-Tom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>You are missing the point.  A move generator _can_ emit a single move, which
>>>>>>>>>can be fed into MakeMove().  Read "Chess Skill in Man and Machine", the chess
>>>>>>>>>4.x section.  They explain this pretty well.  It takes zero loops to emit a
>>>>>>>>>single chess move.  You pick the source square.  You do two table lookups for
>>>>>>>>>bishops (say) and you have all the target squares it can move to.  A single
>>>>>>>>>FirstOne() and you have a <to> square, which is all you need to make the move,
>>>>>>>>>and recursively call Search().
>>>>>>>>
>>>>>>>>So you end up having to call gen() a mess of times. I don't see how that isn't a
>>>>>>>>loop.
>>>>>>>>-Tom
>>>>>>>
>>>>>>>
>>>>>>>As I understand he says that in order to generate one move he doesn't have to
>>>>>>>loop. That's what James explains in another post.
>>>>>>>
>>>>>>>With 0x88 or 16x you have to loop thru empty squares, he says with bitboards you
>>>>>>>don't have to. For each rank, file or diagonal in any configuration (by
>>>>>>>configuration I mean set of empty squares in this rank/file/diagonal), you can
>>>>>>>have precomputed arrays that instantly give you the set of squares (a bitboard)
>>>>>>>a sliding piece can move to.
>>>>>>>
>>>>>>>Not that I support his point of view about bitboards. I prefer to "loop thru
>>>>>>>empty squares" in my L1 cache rather than clobbering the same cache with
>>>>>>>bitboards. And anyway, the time required to extract the rank/file/diagonal from
>>>>>>>the "occupied" bitboard and the time required to process the resulting set of
>>>>>>>"can move to" squares is not required in 0x88 or 16x. And for non-sliding pieces
>>>>>>>(which represent in average half of the pieces present on the board), the method
>>>>>>>does not apply.
>>>>>>
>>>>>>Exactly, it's necessary to process the resulting bitboards. Maybe you can do
>>>>>>some simple operations to get an interesting set of bits, but at some point, you
>>>>>>have to turn those bits into something useful. There has to be a loop somewhere
>>>>>>which extracts the bits and does appropriate things to them. If you're lucky,
>>>>>>your processor has BSF/BSR (or an equivalent) and this loop is relatively fast.
>>>>>>But if you don't have these instructions, I bet the pretty bit patterns aren't
>>>>>>helping you much. Personally, I'm a little sick of people saying, "oh, one AND
>>>>>>operation and I'm done!" and totally ignoring everything else that has to be
>>>>>>done.
>>>>>>
>>>>>>-Tom
>>>>>
>>>>>You are not nearly so sick of hearing that as I am sick of hearing people talk
>>>>>about what you can and can't do with bitboards _without_ ever having tried them.
>>>>>
>>>>>Again, there is _no_ need for a loop.  I can generate a single move (capture)
>>>>>with no loop of any kind.  Anybody can generate a non-capture (single move)
>>>>>without a loop, of course.  But captures are way more common to want, since
>>>>>they are usually tried first.
>>>>
>>>>Fine, let's review something you said earlier:
>>>>>Not to generate moves, No. You generate all the sliding piece moves with two
>>>>>table lookups...
>>>>
>>>>So how about you tell me how you're going to generate multiple moves ("all the
>>>>sliding piece moves") in some sort of machine-usable form without doing a loop?
>>>>Remember, a loop around the move generator is still a loop. Nobody's asking
>>>>whether or not you can generate a single stupid capture without a loop, and
>>>>there's no practical value in that anyway, unless you can be sure that the
>>>>capture is generated in the correct order.
>>>
>>>
>>>Easy, again from Chess Skill in man and machine.  I produce a 64 bit value
>>>for all the bishop moves, by doing two table lookups.  I already know the
>>><from> square to produce these moves.  I use a FirstOne() call to find one
>>>of the destination squares (<to> square).  I clear this bit, save the 64 bit
>>>vector, and make this move.  I recursively call search.  When it returns, I
>>>regrab the 64 bit vector, FirstOne() to find the next destination, make this
>>>move and again call Search() recursively.  The only loop I have is the same
>>>loop everyone has to select the next move.  I have _zero_ loops to _generate_
>>>the moves.
>>>
>>>
>>>
>>>>
>>>>>I've done 0x88, 8X8, 16x16, 10x12, and probably others.  I don't think that
>>>>>move generation is the separating point for bitboards vs the others, except
>>>>>for the fact that I can generate captures far easier.  Bitboards help in other
>>>>>places as well.  And on 64 bit architectures, they make a lot of sense.
>>>>>
>>>>>You ought to do what I did 5 years ago.  Say "I am going to try this for a
>>>>>couple of years, to see if this is worthwhile."   It takes a lot of time and
>>>>
>>>>Remember, I did use bitboards for a while, I know many of the issues involved.
>>>
>>>
>>>I have used them for 5 years.  I have learned far more.  And I am still finding
>>>new ways to do things every few months. The learning doesn't stop after using
>>>them "for a while".
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>>experience and false starts to do bitboards.  But they _can_ work quite well.
>>>>>I can point to several programs that prove this, from Kaissa and chess 4.x in
>>>>>the 1970's, thru Crafty and several others in 2000.
>>>>>
>>>>>But they _do_ take time to learn, just like a programming language does.
>>>>
>>>>You have your way, I have my way. In case you didn't notice, I'm not saying one
>>>>way is better or worse. (Not in this thread, anyway.) So I don't see why you're
>>>>being so violently pro-bitboard. All I'm saying is that you should not jump in
>>>>and say that you can solve the world's problems with a single table lookup,
>>>>because that's simply not accurate.
>>>>
>>>>-Tom
>>>
>>>
>>>I'm not violently pro-bitboard at all.  I simply corrected some _wrong_
>>>information that was being posted here.  I've said hundreds of times that
>>>until we talk about 64 bit cpus, bitboards probably do no better than break
>>>even with other good approaches. I have also said, hundreds of times, that
>>>move generation is _not_ the most important thing done in a chess board. At
>>>least for my code, it is not in the top 5 when you profile things.
>>>
>>>As far as a lookup goes, I can generate _all_ sliding piece moves for a bishop,
>>>with two 64 bit memory loads.  I can generate all captures for a bishop just as
>>>easily, without having to traverse the empty squares.  I didn't say any more
>>>or any less than that.  Since in a chess engine, generating captures is a very
>>>common thing to do, bitboards are good there.  They are good in other places.
>>>They also have their problems.  But memory bandwidth is not particularly one of
>>>them.  And on machines like the EV6-based 64 bit architectures, I think bitmaps
>>>might have a real advantage due to the inherent data density they have.
>>
>>Having never had the pleasure of dealing with BB's I can understand the
>>2 x AND to detect the capture. But how do you get the square from the
>>64 bit integer since you can't address a 64 bit table to get the square
>>immediately so you have to write a piece of clever code for it, right?
>>Looks (very) time consuming to me, or?
>>
>>Ed
>
>Here is relevant part of Crafty (FirstOne(), the LastOne() is virtually the
>same):
>
>(1) Plain C
>
>int FirstOne(BITBOARD arg1) {
>  union doub {
>    unsigned short i[4];
>    BITBOARD d;
>  };
>  register union doub x;
>  x.d=arg1;
>#if defined(LITTLE_ENDIAN_ARCH)
>  if (x.i[3])
>    return (first_ones[x.i[3]]);
>  if (x.i[2])
>    return (first_ones[x.i[2]]+16);
>  if (x.i[1])
>    return (first_ones[x.i[1]]+32);
>  if (x.i[0])
>    return (first_ones[x.i[0]]+48);
>#else
>  if (x.i[0])
>    return (first_ones[x.i[0]]);
>  if (x.i[1])
>    return (first_ones[x.i[1]]+16);
>  if (x.i[2])
>    return (first_ones[x.i[2]]+32);
>  if (x.i[3])
>    return (first_ones[x.i[3]]+48);
>#endif
>  return(64);
>}
>
>(2) x86 assembly (MSVC)
>
>__forceinline int FirstOne(BITBOARD a) {
>  __asm {
>        bsr     edx, dword ptr a+4
>        mov     eax, 31
>        jnz     l1
>        bsr     edx, dword ptr a
>        mov     eax, 63
>        jnz     l1
>        mov     edx, -1
>  l1:   sub     eax, edx
>  }
>}
>
>Portable C version is not terribly efficient (it's trashing the cache), but good
>enough. Assembly version is very fast on P6/PII/PIII, and you can speed it up if
>you'll never pass zero as an argument (worst case would be ~15 CPU ticks,
>average ~10). It's very easy to write portable C version that would be very
>efficient on IA-64, something like this:
>
>int FirstOne (__int64 arg) {
>    __int64 result = 0;
>
>    if (arg > 0xFFFFFFFF) {
>        arg >>= 32;
>        result = 32;
>    }
>    if (arg > 0xFFFF) {
>        arg >>= 16;
>        result += 16;
>    }
>    if (arg > 0xFF) {
>        arg >>= 8;
>        result += 8;
>    }
>    return first_one_8bit[arg];
>}
>
>There would be no branches after compiler would optimize the function, and
>execution time would always be 9 CPU ticks (assuming that 256-bytes
>first_one_8bit[] is in the L1 cache).
>
>It's easy to write LastOne() in the same manner.
>
>Eugene

Thanks Eugene. I checked my documentation concerning the clock cycles
of BSF/BSR. These instruction are horribly slow (80-100 cycles) and
that's why I always have avoided them. Now I read this:

   BSF and BSR are the poorest optimized instructions on the PPlain
   and PMMX, taking approximately 11 + 2*n clock cycles, where n is
   the number of zeros skipped.

and:

   ... on the PPro, PII and PIII, where the bit scan instructions take
   only 1 or 2 clocks.

1 or 2 clocks. That is a giant improvement. Now I wonder if it is true.

Ed



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.