Author: Anthony Cozzie
Date: 08:54:44 07/20/04
Go up one level in this thread
On July 19, 2004 at 13:21:23, Gerd Isenberg wrote: >On July 19, 2004 at 10:59:05, Anthony Cozzie wrote: > >>On July 18, 2004 at 15:33:33, Gerd Isenberg wrote: >> >>> >>>>I am guessing something like 50 cycles? Really not that bad . . . probably >>>>close to the speed of a scan over attack tables. >>>> >>>>anthony >>> >>>14.45ns on a 2.2GHz Athlon64, ~32 cycles now. >>> >>>Some minor changes, byte vector values (weights) 0..63, therefore only one >>>psadbw, no movd but two pextrw, final add with gp. Computed bit masks in two >>>xmm-registers (0x02:0x01). Some better instruction scheduling. >>> >>>Gerd >> >>If you would ship me the new code I would be much obliged (acozzie@verizon.net). >> I am concentrating on parallel code right now, but once that is done I am going >>to do some serious work on my eval. I want to prove Vincent wrong that a good >>eval cannot be done with bitboards :) >> >>32 cycles is _really_ good. I think that on average rotated bitboard attack >>generation is 20 cycles, so that is 50 cycles / piece / mobility = 500 cycles >>(~250 ns on my computer) for all pieces, which is really not bad. In fact, 32 >>cycles is not that much slower than popcount! >> >>anthony > >In contradiction to what a said before, the pandn with constant data seems >faster. To precompute the constants one may try > > pxor xmm4, xmm4 ; 0 > pcmpeqb xmm5, xmm5 ; -1 > psubb xmm4, xmm5 ; 1 > pslldq xmm5, 8 ; -1:0 > psubb xmm4, xmm5 ; 0x02:0x01 > pslld xmm4, 2 ; 0x08:0x04 > pslld xmm4, 2 ; 0x20:0x10 > pslld xmm4, 2 ; 0x80:0x40 > >Here the routine, 24 SSE2 instructions, one direct path, 23 double: >12 * 4 cycles = 48 >12 * 2 cycles = 24 > >Cheers, >Gerd > > >typedef unsigned __int64 BitBoard; >typedef unsigned char WType; > >#define CACHE_LINE 64 >#define CACHE_ALIGN __declspec(align(CACHE_LINE)) > > >/* > bit[64]*byte[64] dotProduct for AMD64 in 32-bit mode > (C) CCC 2004 > > implemented by Gerd Isenberg > initiated by a post from Tony Werten - > and a reply from Bob Hyatt about Cray's vector instructions > > parameters: > bb, a 64-bit word as a boolean vector of bits > weights63, a vector of 64 byte values (weights) in the range of 0..63 > > Note! the weight array should be rotated by 90 degree > if both vectors are consideres as two dimensional 8*8 arrays, > the rotation takes place by exchanging row and column. > For all eight rows,columns: > _ > dotProduct = > (bb[row][col] ? weights63[col][row] : 0) > - > return: dot product > >*/ OK, I have gone through the opteron optimization manual. From what I can tell: The opteron can do 2 64-bit memory accesses each cycle (or 1 SSE2 memory access). The cache has a latency of 3 cycles. Most SSE2 instructions are split into 2 64-bit operations and thus occupy the full pipeline (throughput is 1, not 2). They have a latency of 2. cycle estimate 1 movq xmm0, [bb] ; get the bb in lower 2 lea eax, [MaskConsts] 4 punpcklqdq xmm0, xmm0 ; .. and upper half 3 pxor xmm7, xmm7 ; zero 6 movdqa xmm1, xmm0 7 movdqa xmm2, xmm0 8 movdqa xmm3, xmm0 9 pandn xmm0, [eax] ; mask the consecutive bits 10 pandn xmm1, [eax+16] 11 pandn xmm2, [eax+32] 12 pandn xmm3, [eax+48] 13 mov eax, [weights63] 13 pcmpeqb xmm0, xmm7 ; build the -1|0 byte masks 14 pcmpeqb xmm1, xmm7 15 pcmpeqb xmm2, xmm7 16 pcmpeqb xmm3, xmm7 17 pand xmm0, [eax] ; and them with the weights 18 pand xmm1, [eax+16] 19 pand xmm2, [eax+32] 20 pand xmm3, [eax+48] 22 paddb xmm0, xmm1 ; add all 64 bytes, take care of overflow 24 paddb xmm2, xmm3 ; therefore max weight of 63 is safe 26 paddb xmm0, xmm2 28 psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte 32 pextrw edx, xmm0, 4 ; extract both intermediate sums to gp 33 pextrw eax, xmm0, 0 37 add eax, edx ; final add in gp So my estimate is 37 cycles. Of course, it might be possible to get a little bit of overlap near the end if you had other work to do. What do you think of the following? cycle estimate 1 movl xmm0, [bb] 1 movl xmm2, [bb+1] 4 punpcklbw xmm0, xmm0 5 punpcklbw xmm2, xmm2 6 punpcklbw xmm0, xmm0 7 punpcklbw xmm2, xmm2 8 movdqa xmm1, xmm0 9 movdqa xmm3, xmm2 10 punpcklbw xmm0, xmm0 11 punpcklbw xmm2, xmm2 12 punpckhbw xmm1, xmm1 13 punpckhbw xmm3, xmm3 This costs 5 more cycles, but the array is no longer rotated (I think). This means that it is much easier to create the weight array dynamically. I don't know whether it is a win or not. Lastly, are you sure this wouldn't all be faster in the integer pipes? From what I can see, the integer pipes have as much or more computational power as the "vector" units. If the Opteron could execute 4 SSE2 instructions in parallel it would be different, but . . . anthony
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.