Author: Eugene Nalimov
Date: 00:43:53 06/22/00
Go up one level in this thread
On June 22, 2000 at 03:14:56, Ed Schröder wrote:
>On June 22, 2000 at 01:53:30, Eugene Nalimov wrote:
>
>>On June 22, 2000 at 01:06:43, Ed Schröder wrote:
>>
>>>On June 21, 2000 at 21:51:01, Robert Hyatt wrote:
>>>
>>>>On June 21, 2000 at 19:14:24, Tom Kerrigan wrote:
>>>>
>>>>>On June 21, 2000 at 17:16:05, Robert Hyatt wrote:
>>>>>
>>>>>>On June 21, 2000 at 14:52:34, Tom Kerrigan wrote:
>>>>>>
>>>>>>>On June 21, 2000 at 14:38:23, Christophe Theron wrote:
>>>>>>>
>>>>>>>>On June 21, 2000 at 13:33:14, Tom Kerrigan wrote:
>>>>>>>>
>>>>>>>>>On June 20, 2000 at 21:39:01, Robert Hyatt wrote:
>>>>>>>>>
>>>>>>>>>>On June 20, 2000 at 15:52:53, Tom Kerrigan wrote:
>>>>>>>>>>
>>>>>>>>>>>On June 20, 2000 at 15:03:48, Robert Hyatt wrote:
>>>>>>>>>>>
>>>>>>>>>>>>On June 20, 2000 at 14:07:39, Tom Kerrigan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>On June 19, 2000 at 21:32:56, Robert Hyatt wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>On June 19, 2000 at 20:50:11, John Coffey wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>On June 19, 2000 at 19:48:36, Larry Griffiths wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>I have found bitboards to be an even trade-off on my Pentium system. I have to
>>>>>>>>>>>>>>>>update about 6 bitboards when a piece moves and this generates a lot of
>>>>>>>>>>>>>>>>instructions. I get it back in my IsKingInCheck code so it evens out. I like
>>>>>>>>>>>>>>>>to have fast move generation code, but most of my gains have been through
>>>>>>>>>>>>>>>>alpha-beta, hash-table, killer-move and movelist ordering etc.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>Larry.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Maybe I am too much of a novice, but I don't see yet why I should convert over
>>>>>>>>>>>>>>>to bitboards. Is move generation faster? If so, why? My program scans the
>>>>>>>>>>>>>>>board and uses simple loops to generate moves. Do you not have to do loops
>>>>>>>>>>>>>>>with bitboards?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>Not to generate moves, No. You generate all the sliding piece moves with two
>>>>>>>>>>>>>>table lookups...
>>>>>>>>>>>>>
>>>>>>>>>>>>>Hmmm. I do table lookups all over my program, and none of them seem to be
>>>>>>>>>>>>>generating any moves...
>>>>>>>>>>>>>The fact is that you DO need to loop to generate moves in a bitboard program.
>>>>>>>>>>>>>Maybe it's not the same loop, but it's still a loop.
>>>>>>>>>>>>>
>>>>>>>>>>>>>-Tom
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>Who says so? Ask the Dark Thought guys.
>>>>>>>>>>>>Or Slate/Atkin. You only need to loop if you want to take the attack bitmap
>>>>>>>>>>>>and turn it into a list of moves. This is not the way _all_ programs operate
>>>>>>>>>>>>(chess 4.x, Dark Thought, others, any of which generate a few moves at a time,
>>>>>>>>>>>>then take one and search it, without enumerating the other moves.)
>>>>>>>>>>>>
>>>>>>>>>>>>So loops are something you do (with bitmaps) if you want to, not because you
>>>>>>>>>>>>have to.
>>>>>>>>>>>>
>>>>>>>>>>>>As far as your table lookups not generating any moves, that is a programming
>>>>>>>>>>>>issue. Mine do. :)
>>>>>>>>>>>
>>>>>>>>>>>Maybe your makemove() function can take bitboards as input (i.e., here is a set
>>>>>>>>>>>of squares that my pieces can move to) but mine sure can't.
>>>>>>>>>>>-Tom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>You are missing the point. A move generator _can_ emit a single move, which
>>>>>>>>>>can be fed into MakeMove(). Read "Chess Skill in Man and Machine", the chess
>>>>>>>>>>4.x section. They explain this pretty well. It takes zero loops to emit a
>>>>>>>>>>single chess move. You pick the source square. You do two table lookups for
>>>>>>>>>>bishops (say) and you have all the target squares it can move to. A single
>>>>>>>>>>FirstOne() and you have a <to> square, which is all you need to make the move,
>>>>>>>>>>and recursively call Search().
>>>>>>>>>
>>>>>>>>>So you end up having to call gen() a mess of times. I don't see how that isn't a
>>>>>>>>>loop.
>>>>>>>>>-Tom
>>>>>>>>
>>>>>>>>
>>>>>>>>As I understand he says that in order to generate one move he doesn't have to
>>>>>>>>loop. That's what James explains in another post.
>>>>>>>>
>>>>>>>>With 0x88 or 16x you have to loop thru empty squares, he says with bitboards you
>>>>>>>>don't have to. For each rank, file or diagonal in any configuration (by
>>>>>>>>configuration I mean set of empty squares in this rank/file/diagonal), you can
>>>>>>>>have precomputed arrays that instantly give you the set of squares (a bitboard)
>>>>>>>>a sliding piece can move to.
>>>>>>>>
>>>>>>>>Not that I support his point of view about bitboards. I prefer to "loop thru
>>>>>>>>empty squares" in my L1 cache rather than clobbering the same cache with
>>>>>>>>bitboards. And anyway, the time required to extract the rank/file/diagonal from
>>>>>>>>the "occupied" bitboard and the time required to process the resulting set of
>>>>>>>>"can move to" squares is not required in 0x88 or 16x. And for non-sliding pieces
>>>>>>>>(which represent in average half of the pieces present on the board), the method
>>>>>>>>does not apply.
>>>>>>>
>>>>>>>Exactly, it's necessary to process the resulting bitboards. Maybe you can do
>>>>>>>some simple operations to get an interesting set of bits, but at some point, you
>>>>>>>have to turn those bits into something useful. There has to be a loop somewhere
>>>>>>>which extracts the bits and does appropriate things to them. If you're lucky,
>>>>>>>your processor has BSF/BSR (or an equivalent) and this loop is relatively fast.
>>>>>>>But if you don't have these instructions, I bet the pretty bit patterns aren't
>>>>>>>helping you much. Personally, I'm a little sick of people saying, "oh, one AND
>>>>>>>operation and I'm done!" and totally ignoring everything else that has to be
>>>>>>>done.
>>>>>>>
>>>>>>>-Tom
>>>>>>
>>>>>>You are not nearly so sick of hearing that as I am sick of hearing people talk
>>>>>>about what you can and can't do with bitboards _without_ ever having tried them.
>>>>>>
>>>>>>Again, there is _no_ need for a loop. I can generate a single move (capture)
>>>>>>with no loop of any kind. Anybody can generate a non-capture (single move)
>>>>>>without a loop, of course. But captures are way more common to want, since
>>>>>>they are usually tried first.
>>>>>
>>>>>Fine, let's review something you said earlier:
>>>>>>Not to generate moves, No. You generate all the sliding piece moves with two
>>>>>>table lookups...
>>>>>
>>>>>So how about you tell me how you're going to generate multiple moves ("all the
>>>>>sliding piece moves") in some sort of machine-usable form without doing a loop?
>>>>>Remember, a loop around the move generator is still a loop. Nobody's asking
>>>>>whether or not you can generate a single stupid capture without a loop, and
>>>>>there's no practical value in that anyway, unless you can be sure that the
>>>>>capture is generated in the correct order.
>>>>
>>>>
>>>>Easy, again from Chess Skill in man and machine. I produce a 64 bit value
>>>>for all the bishop moves, by doing two table lookups. I already know the
>>>><from> square to produce these moves. I use a FirstOne() call to find one
>>>>of the destination squares (<to> square). I clear this bit, save the 64 bit
>>>>vector, and make this move. I recursively call search. When it returns, I
>>>>regrab the 64 bit vector, FirstOne() to find the next destination, make this
>>>>move and again call Search() recursively. The only loop I have is the same
>>>>loop everyone has to select the next move. I have _zero_ loops to _generate_
>>>>the moves.
>>>>
>>>>
>>>>
>>>>>
>>>>>>I've done 0x88, 8X8, 16x16, 10x12, and probably others. I don't think that
>>>>>>move generation is the separating point for bitboards vs the others, except
>>>>>>for the fact that I can generate captures far easier. Bitboards help in other
>>>>>>places as well. And on 64 bit architectures, they make a lot of sense.
>>>>>>
>>>>>>You ought to do what I did 5 years ago. Say "I am going to try this for a
>>>>>>couple of years, to see if this is worthwhile." It takes a lot of time and
>>>>>
>>>>>Remember, I did use bitboards for a while, I know many of the issues involved.
>>>>
>>>>
>>>>I have used them for 5 years. I have learned far more. And I am still finding
>>>>new ways to do things every few months. The learning doesn't stop after using
>>>>them "for a while".
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>>experience and false starts to do bitboards. But they _can_ work quite well.
>>>>>>I can point to several programs that prove this, from Kaissa and chess 4.x in
>>>>>>the 1970's, thru Crafty and several others in 2000.
>>>>>>
>>>>>>But they _do_ take time to learn, just like a programming language does.
>>>>>
>>>>>You have your way, I have my way. In case you didn't notice, I'm not saying one
>>>>>way is better or worse. (Not in this thread, anyway.) So I don't see why you're
>>>>>being so violently pro-bitboard. All I'm saying is that you should not jump in
>>>>>and say that you can solve the world's problems with a single table lookup,
>>>>>because that's simply not accurate.
>>>>>
>>>>>-Tom
>>>>
>>>>
>>>>I'm not violently pro-bitboard at all. I simply corrected some _wrong_
>>>>information that was being posted here. I've said hundreds of times that
>>>>until we talk about 64 bit cpus, bitboards probably do no better than break
>>>>even with other good approaches. I have also said, hundreds of times, that
>>>>move generation is _not_ the most important thing done in a chess board. At
>>>>least for my code, it is not in the top 5 when you profile things.
>>>>
>>>>As far as a lookup goes, I can generate _all_ sliding piece moves for a bishop,
>>>>with two 64 bit memory loads. I can generate all captures for a bishop just as
>>>>easily, without having to traverse the empty squares. I didn't say any more
>>>>or any less than that. Since in a chess engine, generating captures is a very
>>>>common thing to do, bitboards are good there. They are good in other places.
>>>>They also have their problems. But memory bandwidth is not particularly one of
>>>>them. And on machines like the EV6-based 64 bit architectures, I think bitmaps
>>>>might have a real advantage due to the inherent data density they have.
>>>
>>>Having never had the pleasure of dealing with BB's I can understand the
>>>2 x AND to detect the capture. But how do you get the square from the
>>>64 bit integer since you can't address a 64 bit table to get the square
>>>immediately so you have to write a piece of clever code for it, right?
>>>Looks (very) time consuming to me, or?
>>>
>>>Ed
>>
>>Here is relevant part of Crafty (FirstOne(), the LastOne() is virtually the
>>same):
>>
>>(1) Plain C
>>
>>int FirstOne(BITBOARD arg1) {
>> union doub {
>> unsigned short i[4];
>> BITBOARD d;
>> };
>> register union doub x;
>> x.d=arg1;
>>#if defined(LITTLE_ENDIAN_ARCH)
>> if (x.i[3])
>> return (first_ones[x.i[3]]);
>> if (x.i[2])
>> return (first_ones[x.i[2]]+16);
>> if (x.i[1])
>> return (first_ones[x.i[1]]+32);
>> if (x.i[0])
>> return (first_ones[x.i[0]]+48);
>>#else
>> if (x.i[0])
>> return (first_ones[x.i[0]]);
>> if (x.i[1])
>> return (first_ones[x.i[1]]+16);
>> if (x.i[2])
>> return (first_ones[x.i[2]]+32);
>> if (x.i[3])
>> return (first_ones[x.i[3]]+48);
>>#endif
>> return(64);
>>}
>>
>>(2) x86 assembly (MSVC)
>>
>>__forceinline int FirstOne(BITBOARD a) {
>> __asm {
>> bsr edx, dword ptr a+4
>> mov eax, 31
>> jnz l1
>> bsr edx, dword ptr a
>> mov eax, 63
>> jnz l1
>> mov edx, -1
>> l1: sub eax, edx
>> }
>>}
>>
>>Portable C version is not terribly efficient (it's trashing the cache), but good
>>enough. Assembly version is very fast on P6/PII/PIII, and you can speed it up if
>>you'll never pass zero as an argument (worst case would be ~15 CPU ticks,
>>average ~10). It's very easy to write portable C version that would be very
>>efficient on IA-64, something like this:
>>
>>int FirstOne (__int64 arg) {
>> __int64 result = 0;
>>
>> if (arg > 0xFFFFFFFF) {
>> arg >>= 32;
>> result = 32;
>> }
>> if (arg > 0xFFFF) {
>> arg >>= 16;
>> result += 16;
>> }
>> if (arg > 0xFF) {
>> arg >>= 8;
>> result += 8;
>> }
>> return first_one_8bit[arg];
>>}
>>
>>There would be no branches after compiler would optimize the function, and
>>execution time would always be 9 CPU ticks (assuming that 256-bytes
>>first_one_8bit[] is in the L1 cache).
>>
>>It's easy to write LastOne() in the same manner.
>>
>>Eugene
>
>Thanks Eugene. I checked my documentation concerning the clock cycles
>of BSF/BSR. These instruction are horribly slow (80-100 cycles) and
>that's why I always have avoided them. Now I read this:
>
> BSF and BSR are the poorest optimized instructions on the PPlain
> and PMMX, taking approximately 11 + 2*n clock cycles, where n is
> the number of zeros skipped.
>
>and:
>
> ... on the PPro, PII and PIII, where the bit scan instructions take
> only 1 or 2 clocks.
>
>1 or 2 clocks. That is a giant improvement. Now I wonder if it is true.
>
>Ed
That's true, at least on latest Intel processors (don't know about AMD).
According to Intel's documentation they are translated to 3 micro-ops.
Actually, on P6-compatible CPU (or any RISC CPU with conditional move) you can
achieve similar effect using 'conditional move' instruction. My latest C example
can be converted to something like
push edi
xor eax, eax
cmp eax, dword ptr [esp+8]
sbb eax, eax
mov ecx, eax
neg eax
and ecx, 32
mov eax, dword ptr [esp+8+4*eax]
mov edx, eax
lea edi, [ecx+16]
shr edx, 16
test eax, 0FFFF0000H
cmovne eax, edx
cmovne ecx, edi
mov edx, eax
lea edi, [ecx+8]
shr edx, 8
test eax, 0FF00H
cmovne eax, edx
cmovne ecx, edi
movzx eax, byte ptr first_one_8bit[eax]
pop edi
add eax, ecx
ret
(code is directly from my head and is not tested).
Eugene
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.