Author: Gerd Isenberg
Date: 10:21:23 07/19/04
Go up one level in this thread
On July 19, 2004 at 10:59:05, Anthony Cozzie wrote:
>On July 18, 2004 at 15:33:33, Gerd Isenberg wrote:
>
>>
>>>I am guessing something like 50 cycles? Really not that bad . . . probably
>>>close to the speed of a scan over attack tables.
>>>
>>>anthony
>>
>>14.45ns on a 2.2GHz Athlon64, ~32 cycles now.
>>
>>Some minor changes, byte vector values (weights) 0..63, therefore only one
>>psadbw, no movd but two pextrw, final add with gp. Computed bit masks in two
>>xmm-registers (0x02:0x01). Some better instruction scheduling.
>>
>>Gerd
>
>If you would ship me the new code I would be much obliged (acozzie@verizon.net).
> I am concentrating on parallel code right now, but once that is done I am going
>to do some serious work on my eval. I want to prove Vincent wrong that a good
>eval cannot be done with bitboards :)
>
>32 cycles is _really_ good. I think that on average rotated bitboard attack
>generation is 20 cycles, so that is 50 cycles / piece / mobility = 500 cycles
>(~250 ns on my computer) for all pieces, which is really not bad. In fact, 32
>cycles is not that much slower than popcount!
>
>anthony
In contradiction to what a said before, the pandn with constant data seems
faster. To precompute the constants one may try
pxor xmm4, xmm4 ; 0
pcmpeqb xmm5, xmm5 ; -1
psubb xmm4, xmm5 ; 1
pslldq xmm5, 8 ; -1:0
psubb xmm4, xmm5 ; 0x02:0x01
pslld xmm4, 2 ; 0x08:0x04
pslld xmm4, 2 ; 0x20:0x10
pslld xmm4, 2 ; 0x80:0x40
Here the routine, 24 SSE2 instructions, one direct path, 23 double:
12 * 4 cycles = 48
12 * 2 cycles = 24
Cheers,
Gerd
typedef unsigned __int64 BitBoard;
typedef unsigned char WType;
#define CACHE_LINE 64
#define CACHE_ALIGN __declspec(align(CACHE_LINE))
/*
bit[64]*byte[64] dotProduct for AMD64 in 32-bit mode
(C) CCC 2004
implemented by Gerd Isenberg
initiated by a post from Tony Werten -
and a reply from Bob Hyatt about Cray's vector instructions
parameters:
bb, a 64-bit word as a boolean vector of bits
weights63, a vector of 64 byte values (weights) in the range of 0..63
Note! the weight array should be rotated by 90 degree
if both vectors are consideres as two dimensional 8*8 arrays,
the rotation takes place by exchanging row and column.
For all eight rows,columns:
_
dotProduct = > (bb[row][col] ? weights63[col][row] : 0)
-
return: dot product
*/
int dotProduct(BitBoard bb, WType* weights63)
{
static const BitBoard CACHE_ALIGN MaskConsts[8] =
{
0x0101010101010101, 0x0202020202020202,
0x0404040404040404, 0x0808080808080808,
0x1010101010101010, 0x2020202020202020,
0x4040404040404040, 0x8080808080808080,
};
__asm
{
movq xmm0, [bb] ; get the bb in lower
lea eax, [MaskConsts]
punpcklqdq xmm0, xmm0 ; .. and upper half
pxor xmm7, xmm7 ; zero
movdqa xmm1, xmm0
movdqa xmm2, xmm0
movdqa xmm3, xmm0
pandn xmm0, [eax] ; mask the consecutive bits
pandn xmm1, [eax+16]
pcmpeqb xmm0, xmm7 ; build the -1|0 byte masks
pcmpeqb xmm1, xmm7
pandn xmm2, [eax+32]
pandn xmm3, [eax+48]
mov eax, [weights63]
pcmpeqb xmm2, xmm7
pcmpeqb xmm3, xmm7
pand xmm0, [eax] ; and them with the weights
pand xmm1, [eax+16]
pand xmm2, [eax+32]
pand xmm3, [eax+48]
paddb xmm0, xmm1 ; add all 64 bytes, take care of overflow
paddb xmm0, xmm2 ; therefore max weight of 63 is safe
paddb xmm0, xmm3
psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte
pextrw edx, xmm0, 4 ; extract both intermediate sums to gp
pextrw eax, xmm0, 0
add eax, edx ; final add in gp
}
}
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.