Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitboards !! :)

Author: Gerd Isenberg

Date: 13:56:50 07/01/04

Go up one level in this thread


On July 01, 2004 at 15:49:21, Robert Hyatt wrote:

>On July 01, 2004 at 15:02:34, Gerd Isenberg wrote:
>
>>On July 01, 2004 at 11:13:14, Robert Hyatt wrote:
>>
>>>On July 01, 2004 at 02:50:35, Tony Werten wrote:
>>>
>>>>Hi all,
>>>>
>>>>although I like the principle of bitboards, it really bothers me that I can't
>>>>seem to find a decent/fast way to evaluate weighted safe squares.
>>>>
>>>>Suppose I want to (simple) evaluate a rook, I generate a bitboard with all
>>>>reachable squares and mask off the squares attacked by lower pieces (that's no
>>>>problem).
>>>>
>>>>(This doesn't exacly generate safe squares, only the ones that aren't attacked
>>>>at all by opponents pieces are, for the remaining squares one would need a SEE,
>>>>but that's not the point )
>>>>
>>>>Now I can use this bitboard ( say rook on e4 ), mask the rank state, and look in
>>>>a precomputed table how this rankstate scores on an e rank. No problem.
>>>>
>>>>But how to do the files ? If I use the rotated board, I need to have the
>>>>opponents attackboard in this rotated board as well, wich would be very costly
>>>>to compute (ie also for the bishops,queens ) and very complicated.
>>>>
>>>>Any ideas ? Am I missing something ?
>>>>
>>>>BTW, doing a popcount isn't a solution, since it violates the elegance of
>>>>bitboards ( and is slow ?)
>>>>
>>>>Tony
>>>
>>>
>>>On the Cray there is an elegant solution, but not on X86 so far...
>>>
>>>You can create a 64-word vector of "weights".  How you compute these is up to
>>>you.  In Cray Blitz I did this as I did the evaluation, figuring out which
>>>squares were weak, unimportant, strong, useful, painful for opponent, etc.
>>>After the normal eval, I had a vector of values, one per square for all squares
>>>on the board.  Now I computed the "attack bitmap" for a piece, and stuck that in
>>>the vector mask register.  Now when I sum up the square value vector, it only
>>>sums the values with a corresponding bit mask of 1, meaning this piece attacks
>>>that square safely.
>>
>>Wow great, a scalar product 64word*64bit.
>>Was it implemented in hardware or a kind of micro-program?
>
>Took a couple of instructions.  "vector mask" selects the words you want, you
>pipe them into a "reduction" operation successively to collapse N words to 1
>final sum.  This "chains" so it takes essentially no extra time to do, which is
>cute.. :)  But no similar facility on non-cray cpus to date...
>
>
>
>
>>
>>>
>>>I obviously don't do that at present, since X86 has no such direct capability
>>>and the software approach is expensive...
>>
>>Thinking about some oppropriate SSE2-instructions for that scalar product, eg.
>>64 bytes * 64 bit. Four 128-bit (16Byte) xmmm registers where each byte is
>>associated with one bit of the other operand.
>>
>>One subtask, may be the most expensive, is to expand each bit to one byte, so
>>that 1 becomes 0xff. From 64-bit word to four times 128-bit words.
>>
>
>
>I hate corresponding with you.  I end up with a _headache_ every last time you
>start that stuff.  :)

One way to expand the bit to bytes is with 16-bit word lookups getting 128 bits.
But i guess there is a smarter, less memory expensive approach with pure
register calculation. It's simply a "sign extension" from one to eight bits each
;-)

Than you have two 64 byte vectors, one with the weights, the other with binary
masks 0 and -1 for each initial bit.

Inside a byte loop:

  for (i=0, sum = 0; i < 64; i++)
     sum += weight[i] & mask[i]; // either null or weight[i]

With four xmm registers, assume the expanded bitboard in xmm0..3.

  mov  rax, [weight] ; load pointer of the aligned weight vector
  pand xmm0, xmm ptr [rax + 0]
  pand xmm1, xmm ptr [rax + 16]
  pand xmm2, xmm ptr [rax + 32]
  pand xmm3, xmm ptr [rax + 48]

Then four times psadbw (Packed Sum of Absolute Differences of Bytes
Into a Word) with zero:

  pxor   xmm4, xmm4 ; zero, may be scheduled a bit earlier
  psadbw xmm0, xmm4
  psadbw xmm1, xmm4
  psadbw xmm2, xmm4
  psadbw xmm3, xmm4

  paddd  xmm0, xmm1
  paddd  xmm2, xmm3
  paddd  xmm2, xmm0 ; two final sums in each 64-bit word

  PUNPCKHQDQ xmm0, xmm2
  paddd  xmm0, xmm2 ; final sums

  movq   rax, xmm0  ; should be avoided, better pass through memory

msc for AMD64 has appropriate intrinsics.

I am curious how fast that will be after WCCC on my new AMD64 Shuttle box.
That works still (with xmm0..xmm7) in 32-bit mode with my current 32-bit
development tools.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.