Author: Gerd Isenberg
Date: 10:21:23 07/19/04
Go up one level in this thread
On July 19, 2004 at 10:59:05, Anthony Cozzie wrote: >On July 18, 2004 at 15:33:33, Gerd Isenberg wrote: > >> >>>I am guessing something like 50 cycles? Really not that bad . . . probably >>>close to the speed of a scan over attack tables. >>> >>>anthony >> >>14.45ns on a 2.2GHz Athlon64, ~32 cycles now. >> >>Some minor changes, byte vector values (weights) 0..63, therefore only one >>psadbw, no movd but two pextrw, final add with gp. Computed bit masks in two >>xmm-registers (0x02:0x01). Some better instruction scheduling. >> >>Gerd > >If you would ship me the new code I would be much obliged (acozzie@verizon.net). > I am concentrating on parallel code right now, but once that is done I am going >to do some serious work on my eval. I want to prove Vincent wrong that a good >eval cannot be done with bitboards :) > >32 cycles is _really_ good. I think that on average rotated bitboard attack >generation is 20 cycles, so that is 50 cycles / piece / mobility = 500 cycles >(~250 ns on my computer) for all pieces, which is really not bad. In fact, 32 >cycles is not that much slower than popcount! > >anthony In contradiction to what a said before, the pandn with constant data seems faster. To precompute the constants one may try pxor xmm4, xmm4 ; 0 pcmpeqb xmm5, xmm5 ; -1 psubb xmm4, xmm5 ; 1 pslldq xmm5, 8 ; -1:0 psubb xmm4, xmm5 ; 0x02:0x01 pslld xmm4, 2 ; 0x08:0x04 pslld xmm4, 2 ; 0x20:0x10 pslld xmm4, 2 ; 0x80:0x40 Here the routine, 24 SSE2 instructions, one direct path, 23 double: 12 * 4 cycles = 48 12 * 2 cycles = 24 Cheers, Gerd typedef unsigned __int64 BitBoard; typedef unsigned char WType; #define CACHE_LINE 64 #define CACHE_ALIGN __declspec(align(CACHE_LINE)) /* bit[64]*byte[64] dotProduct for AMD64 in 32-bit mode (C) CCC 2004 implemented by Gerd Isenberg initiated by a post from Tony Werten - and a reply from Bob Hyatt about Cray's vector instructions parameters: bb, a 64-bit word as a boolean vector of bits weights63, a vector of 64 byte values (weights) in the range of 0..63 Note! the weight array should be rotated by 90 degree if both vectors are consideres as two dimensional 8*8 arrays, the rotation takes place by exchanging row and column. For all eight rows,columns: _ dotProduct = > (bb[row][col] ? weights63[col][row] : 0) - return: dot product */ int dotProduct(BitBoard bb, WType* weights63) { static const BitBoard CACHE_ALIGN MaskConsts[8] = { 0x0101010101010101, 0x0202020202020202, 0x0404040404040404, 0x0808080808080808, 0x1010101010101010, 0x2020202020202020, 0x4040404040404040, 0x8080808080808080, }; __asm { movq xmm0, [bb] ; get the bb in lower lea eax, [MaskConsts] punpcklqdq xmm0, xmm0 ; .. and upper half pxor xmm7, xmm7 ; zero movdqa xmm1, xmm0 movdqa xmm2, xmm0 movdqa xmm3, xmm0 pandn xmm0, [eax] ; mask the consecutive bits pandn xmm1, [eax+16] pcmpeqb xmm0, xmm7 ; build the -1|0 byte masks pcmpeqb xmm1, xmm7 pandn xmm2, [eax+32] pandn xmm3, [eax+48] mov eax, [weights63] pcmpeqb xmm2, xmm7 pcmpeqb xmm3, xmm7 pand xmm0, [eax] ; and them with the weights pand xmm1, [eax+16] pand xmm2, [eax+32] pand xmm3, [eax+48] paddb xmm0, xmm1 ; add all 64 bytes, take care of overflow paddb xmm0, xmm2 ; therefore max weight of 63 is safe paddb xmm0, xmm3 psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte pextrw edx, xmm0, 4 ; extract both intermediate sums to gp pextrw eax, xmm0, 0 add eax, edx ; final add in gp } }
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.