Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Bitboards !! :)

Author: Tony Werten
Date: 01:12:04 07/02/04
On July 01, 2004 at 16:56:50, Gerd Isenberg wrote:

>On July 01, 2004 at 15:49:21, Robert Hyatt wrote:
>
>>On July 01, 2004 at 15:02:34, Gerd Isenberg wrote:
>>
>>>On July 01, 2004 at 11:13:14, Robert Hyatt wrote:
>>>
>>>>On July 01, 2004 at 02:50:35, Tony Werten wrote:
>>>>
>>>>>Hi all,
>>>>>
>>>>>although I like the principle of bitboards, it really bothers me that I can't
>>>>>seem to find a decent/fast way to evaluate weighted safe squares.
>>>>>
>>>>>Suppose I want to (simple) evaluate a rook, I generate a bitboard with all
>>>>>reachable squares and mask off the squares attacked by lower pieces (that's no
>>>>>problem).
>>>>>
>>>>>(This doesn't exacly generate safe squares, only the ones that aren't attacked
>>>>>at all by opponents pieces are, for the remaining squares one would need a SEE,
>>>>>but that's not the point )
>>>>>
>>>>>Now I can use this bitboard ( say rook on e4 ), mask the rank state, and look in
>>>>>a precomputed table how this rankstate scores on an e rank. No problem.
>>>>>
>>>>>But how to do the files ? If I use the rotated board, I need to have the
>>>>>opponents attackboard in this rotated board as well, wich would be very costly
>>>>>to compute (ie also for the bishops,queens ) and very complicated.
>>>>>
>>>>>Any ideas ? Am I missing something ?
>>>>>
>>>>>BTW, doing a popcount isn't a solution, since it violates the elegance of
>>>>>bitboards ( and is slow ?)
>>>>>
>>>>>Tony
>>>>
>>>>
>>>>On the Cray there is an elegant solution, but not on X86 so far...
>>>>
>>>>You can create a 64-word vector of "weights".  How you compute these is up to
>>>>you.  In Cray Blitz I did this as I did the evaluation, figuring out which
>>>>squares were weak, unimportant, strong, useful, painful for opponent, etc.
>>>>After the normal eval, I had a vector of values, one per square for all squares
>>>>on the board.  Now I computed the "attack bitmap" for a piece, and stuck that in
>>>>the vector mask register.  Now when I sum up the square value vector, it only
>>>>sums the values with a corresponding bit mask of 1, meaning this piece attacks
>>>>that square safely.
>>>
>>>Wow great, a scalar product 64word*64bit.
>>>Was it implemented in hardware or a kind of micro-program?
>>
>>Took a couple of instructions.  "vector mask" selects the words you want, you
>>pipe them into a "reduction" operation successively to collapse N words to 1
>>final sum.  This "chains" so it takes essentially no extra time to do, which is
>>cute.. :)  But no similar facility on non-cray cpus to date...
>>
>>
>>
>>
>>>
>>>>
>>>>I obviously don't do that at present, since X86 has no such direct capability
>>>>and the software approach is expensive...
>>>
>>>Thinking about some oppropriate SSE2-instructions for that scalar product, eg.
>>>64 bytes * 64 bit. Four 128-bit (16Byte) xmmm registers where each byte is
>>>associated with one bit of the other operand.
>>>
>>>One subtask, may be the most expensive, is to expand each bit to one byte, so
>>>that 1 becomes 0xff. From 64-bit word to four times 128-bit words.
>>>
>>
>>
>>I hate corresponding with you.  I end up with a _headache_ every last time you
>>start that stuff.  :)
>
>One way to expand the bit to bytes is with 16-bit word lookups getting 128 bits.
>But i guess there is a smarter, less memory expensive approach with pure
>register calculation. It's simply a "sign extension" from one to eight bits each
>;-)

That part doesn't have to be too slow. Read the bitboard with bytes at a time.
Write them (from a lookup table ) as expanded 64 bits. Now you have an byte
array[64] of "booleans".
Only 8 lookups and 8 writes needed.

I was hoping there would be a way to approach a bboard as an array[64] of
booleans and expend them to bytebooleans somehow.

>
>Than you have two 64 byte vectors, one with the weights, the other with binary
>masks 0 and -1 for each initial bit.
>
>Inside a byte loop:
>
>  for (i=0, sum = 0; i < 64; i++)
>     sum += weight[i] & mask[i]; // either null or weight[i]

This is the expensive part I think.

>
>With four xmm registers, assume the expanded bitboard in xmm0..3.
>
>  mov  rax, [weight] ; load pointer of the aligned weight vector
>  pand xmm0, xmm ptr [rax + 0]
>  pand xmm1, xmm ptr [rax + 16]
>  pand xmm2, xmm ptr [rax + 32]
>  pand xmm3, xmm ptr [rax + 48]
>
>Then four times psadbw (Packed Sum of Absolute Differences of Bytes
>Into a Word) with zero:
>
>  pxor   xmm4, xmm4 ; zero, may be scheduled a bit earlier
>  psadbw xmm0, xmm4
>  psadbw xmm1, xmm4
>  psadbw xmm2, xmm4
>  psadbw xmm3, xmm4
>
>  paddd  xmm0, xmm1
>  paddd  xmm2, xmm3
>  paddd  xmm2, xmm0 ; two final sums in each 64-bit word
>
>  PUNPCKHQDQ xmm0, xmm2
>  paddd  xmm0, xmm2 ; final sums
>
>  movq   rax, xmm0  ; should be avoided, better pass through memory
>
>msc for AMD64 has appropriate intrinsics.
>
>I am curious how fast that will be after WCCC on my new AMD64 Shuttle box.
>That works still (with xmm0..xmm7) in 32-bit mode with my current 32-bit
>development tools.

I thought something like this is possible in mmx/xmm, just didn't know how.
Isn't there a way to mulptiply 8 bytes with 8 bytes, assuming every mulptiply
doesn't exceed 255 ? (wich is safe because 1 byte array contains only 0 or 1)

Ah, suddenly got your point. Don't expand to 0 and 1 but to 0 and 0xFF so you
can simply and every factor in. Maybe if the factor is smaller (0..15), 2 can
fit in 1 byte. So only 2 xmm registers or 4 mmx/64 bits registers are used. With
4 general purpose registers, the slowdown might be minimal.

If the factor is even smaller (0..3) the number of registers can be halfed again
so the only cost will be the expanding and collecting/adding of the bonusses.

I'm interested in your speed experiments on this all. Keep me (us) informed and
good luck in Israel.

Tony
Re: Bitboards !! :) Gerd Isenberg 22:39:33 07/04/04
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.