Author: Frank E. Oldham
Date: 15:22:37 07/01/04
Go up one level in this thread
On July 01, 2004 at 18:12:15, Robert Hyatt wrote: >On July 01, 2004 at 16:56:50, Gerd Isenberg wrote: > >>On July 01, 2004 at 15:49:21, Robert Hyatt wrote: >> >>>On July 01, 2004 at 15:02:34, Gerd Isenberg wrote: >>> >>>>On July 01, 2004 at 11:13:14, Robert Hyatt wrote: >>>> >>>>>On July 01, 2004 at 02:50:35, Tony Werten wrote: >>>>> >>>>>>Hi all, >>>>>> >>>>>>although I like the principle of bitboards, it really bothers me that I can't >>>>>>seem to find a decent/fast way to evaluate weighted safe squares. >>>>>> >>>>>>Suppose I want to (simple) evaluate a rook, I generate a bitboard with all >>>>>>reachable squares and mask off the squares attacked by lower pieces (that's no >>>>>>problem). >>>>>> >>>>>>(This doesn't exacly generate safe squares, only the ones that aren't attacked >>>>>>at all by opponents pieces are, for the remaining squares one would need a SEE, >>>>>>but that's not the point ) >>>>>> >>>>>>Now I can use this bitboard ( say rook on e4 ), mask the rank state, and look in >>>>>>a precomputed table how this rankstate scores on an e rank. No problem. >>>>>> >>>>>>But how to do the files ? If I use the rotated board, I need to have the >>>>>>opponents attackboard in this rotated board as well, wich would be very costly >>>>>>to compute (ie also for the bishops,queens ) and very complicated. >>>>>> >>>>>>Any ideas ? Am I missing something ? >>>>>> >>>>>>BTW, doing a popcount isn't a solution, since it violates the elegance of >>>>>>bitboards ( and is slow ?) >>>>>> >>>>>>Tony >>>>> >>>>> >>>>>On the Cray there is an elegant solution, but not on X86 so far... >>>>> >>>>>You can create a 64-word vector of "weights". How you compute these is up to >>>>>you. In Cray Blitz I did this as I did the evaluation, figuring out which >>>>>squares were weak, unimportant, strong, useful, painful for opponent, etc. >>>>>After the normal eval, I had a vector of values, one per square for all squares >>>>>on the board. Now I computed the "attack bitmap" for a piece, and stuck that in >>>>>the vector mask register. Now when I sum up the square value vector, it only >>>>>sums the values with a corresponding bit mask of 1, meaning this piece attacks >>>>>that square safely. >>>> >>>>Wow great, a scalar product 64word*64bit. >>>>Was it implemented in hardware or a kind of micro-program? >>> >>>Took a couple of instructions. "vector mask" selects the words you want, you >>>pipe them into a "reduction" operation successively to collapse N words to 1 >>>final sum. This "chains" so it takes essentially no extra time to do, which is >>>cute.. :) But no similar facility on non-cray cpus to date... >>> >>> >>> >>> >>>> >>>>> >>>>>I obviously don't do that at present, since X86 has no such direct capability >>>>>and the software approach is expensive... >>>> >>>>Thinking about some oppropriate SSE2-instructions for that scalar product, eg. >>>>64 bytes * 64 bit. Four 128-bit (16Byte) xmmm registers where each byte is >>>>associated with one bit of the other operand. >>>> >>>>One subtask, may be the most expensive, is to expand each bit to one byte, so >>>>that 1 becomes 0xff. From 64-bit word to four times 128-bit words. >>>> >>> >>> >>>I hate corresponding with you. I end up with a _headache_ every last time you >>>start that stuff. :) >> >>One way to expand the bit to bytes is with 16-bit word lookups getting 128 bits. >>But i guess there is a smarter, less memory expensive approach with pure >>register calculation. It's simply a "sign extension" from one to eight bits each >>;-) >> >>Than you have two 64 byte vectors, one with the weights, the other with binary >>masks 0 and -1 for each initial bit. >> >>Inside a byte loop: >> >> for (i=0, sum = 0; i < 64; i++) >> sum += weight[i] & mask[i]; // either null or weight[i] >> >>With four xmm registers, assume the expanded bitboard in xmm0..3. >> >> mov rax, [weight] ; load pointer of the aligned weight vector >> pand xmm0, xmm ptr [rax + 0] >> pand xmm1, xmm ptr [rax + 16] >> pand xmm2, xmm ptr [rax + 32] >> pand xmm3, xmm ptr [rax + 48] >> >>Then four times psadbw (Packed Sum of Absolute Differences of Bytes >>Into a Word) with zero: >> >> pxor xmm4, xmm4 ; zero, may be scheduled a bit earlier >> psadbw xmm0, xmm4 >> psadbw xmm1, xmm4 >> psadbw xmm2, xmm4 >> psadbw xmm3, xmm4 >> >> paddd xmm0, xmm1 >> paddd xmm2, xmm3 >> paddd xmm2, xmm0 ; two final sums in each 64-bit word >> >> PUNPCKHQDQ xmm0, xmm2 >> paddd xmm0, xmm2 ; final sums >> >> movq rax, xmm0 ; should be avoided, better pass through memory >> >>msc for AMD64 has appropriate intrinsics. >> >>I am curious how fast that will be after WCCC on my new AMD64 Shuttle box. >>That works still (with xmm0..xmm7) in 32-bit mode with my current 32-bit >>development tools. > > >I am curious why your head doesn't explode. > >:) I want to see the altivec perm unit code too; that'll probably only make our eyes cross :-) Frank
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.