Author: Gerd Isenberg
Date: 11:23:30 07/20/04
Go up one level in this thread
On July 20, 2004 at 11:54:44, Anthony Cozzie wrote: >On July 19, 2004 at 13:21:23, Gerd Isenberg wrote: > >>On July 19, 2004 at 10:59:05, Anthony Cozzie wrote: >> >>>On July 18, 2004 at 15:33:33, Gerd Isenberg wrote: >>> >>>> >>>>>I am guessing something like 50 cycles? Really not that bad . . . probably >>>>>close to the speed of a scan over attack tables. >>>>> >>>>>anthony >>>> >>>>14.45ns on a 2.2GHz Athlon64, ~32 cycles now. >>>> >>>>Some minor changes, byte vector values (weights) 0..63, therefore only one >>>>psadbw, no movd but two pextrw, final add with gp. Computed bit masks in two >>>>xmm-registers (0x02:0x01). Some better instruction scheduling. >>>> >>>>Gerd >>> >>>If you would ship me the new code I would be much obliged (acozzie@verizon.net). >>> I am concentrating on parallel code right now, but once that is done I am going >>>to do some serious work on my eval. I want to prove Vincent wrong that a good >>>eval cannot be done with bitboards :) >>> >>>32 cycles is _really_ good. I think that on average rotated bitboard attack >>>generation is 20 cycles, so that is 50 cycles / piece / mobility = 500 cycles >>>(~250 ns on my computer) for all pieces, which is really not bad. In fact, 32 >>>cycles is not that much slower than popcount! >>> >>>anthony >> >>In contradiction to what a said before, the pandn with constant data seems >>faster. To precompute the constants one may try >> >> pxor xmm4, xmm4 ; 0 >> pcmpeqb xmm5, xmm5 ; -1 >> psubb xmm4, xmm5 ; 1 >> pslldq xmm5, 8 ; -1:0 >> psubb xmm4, xmm5 ; 0x02:0x01 >> pslld xmm4, 2 ; 0x08:0x04 >> pslld xmm4, 2 ; 0x20:0x10 >> pslld xmm4, 2 ; 0x80:0x40 >> >>Here the routine, 24 SSE2 instructions, one direct path, 23 double: >>12 * 4 cycles = 48 >>12 * 2 cycles = 24 >> >>Cheers, >>Gerd >> >> >>typedef unsigned __int64 BitBoard; >>typedef unsigned char WType; >> >>#define CACHE_LINE 64 >>#define CACHE_ALIGN __declspec(align(CACHE_LINE)) >> >> >>/* >> bit[64]*byte[64] dotProduct for AMD64 in 32-bit mode >> (C) CCC 2004 >> >> implemented by Gerd Isenberg >> initiated by a post from Tony Werten - >> and a reply from Bob Hyatt about Cray's vector instructions >> >> parameters: >> bb, a 64-bit word as a boolean vector of bits >> weights63, a vector of 64 byte values (weights) in the range of 0..63 >> >> Note! the weight array should be rotated by 90 degree >> if both vectors are consideres as two dimensional 8*8 arrays, >> the rotation takes place by exchanging row and column. >> For all eight rows,columns: >> _ >> dotProduct = > (bb[row][col] ? weights63[col][row] : 0) >> - >> return: dot product >> >>*/ > >OK, I have gone through the opteron optimization manual. From what I can tell: > >The opteron can do 2 64-bit memory accesses each cycle (or 1 SSE2 memory >access). The cache has a latency of 3 cycles. > >Most SSE2 instructions are split into 2 64-bit operations and thus occupy the >full pipeline (throughput is 1, not 2). They have a latency of 2. > >cycle estimate > 1 movq xmm0, [bb] ; get the bb in lower > 2 lea eax, [MaskConsts] > 4 punpcklqdq xmm0, xmm0 ; .. and upper half > 3 pxor xmm7, xmm7 ; zero > 6 movdqa xmm1, xmm0 > 7 movdqa xmm2, xmm0 > 8 movdqa xmm3, xmm0 > 9 pandn xmm0, [eax] ; mask the consecutive bits > 10 pandn xmm1, [eax+16] > 11 pandn xmm2, [eax+32] > 12 pandn xmm3, [eax+48] > 13 mov eax, [weights63] > 13 pcmpeqb xmm0, xmm7 ; build the -1|0 byte masks > 14 pcmpeqb xmm1, xmm7 > 15 pcmpeqb xmm2, xmm7 > 16 pcmpeqb xmm3, xmm7 > 17 pand xmm0, [eax] ; and them with the weights > 18 pand xmm1, [eax+16] > 19 pand xmm2, [eax+32] > 20 pand xmm3, [eax+48] > 22 paddb xmm0, xmm1 ; add all 64 bytes, take care of overflow > 24 paddb xmm0, xmm2 ; therefore max weight of 63 is safe > 26 paddb xmm0, xmm3 > 28 psadbw xmm0, xmm7 ; horizontal add 2 * 8 byte > 32 pextrw edx, xmm0, 4 ; extract both intermediate sums to gp > 33 pextrw eax, xmm0, 0 > 37 add eax, edx ; final add in gp > >So my estimate is 37 cycles. Of course, it might be possible to get a little >bit of overlap near the end if you had other work to do. Thanks for pointing out the cycle arithmetic. For instance an independent trailing instruction, starts at cycle 32 too, or even earlier? What about the parallel use of the "three" 64-bit alus, fadd, fmul and load/store? A dump loop test of 10**9 runs finishes in less than 11 seconds (~24cycles) inlined, constant parameter, the asm output looks ok. Your version is about ~13.75 seconds (~30cycles). Btw. i found working three times on one target register slightly faster than two independent and one dependent paddb. The intermediate result may stay near the alus... > >What do you think of the following? > >cycle estimate 1 movd xmm0, [bb] 1 movd xmm2, [bb+4] 4 punpcklbw xmm0, xmm0 5 punpcklbw xmm2, xmm2 6 punpcklbw xmm0, xmm0 7 punpcklbw xmm2, xmm2 8 movdqa xmm1, xmm0 9 movdqa xmm3, xmm2 10 punpcklbw xmm0, xmm0 11 punpcklbw xmm2, xmm2 12 punpckhbw xmm1, xmm1 13 punpckhbw xmm3, xmm3 > >This costs 5 more cycles, but the array is no longer rotated (I think). This >means that it is much easier to create the weight array dynamically. I don't >know whether it is a win or not. What a fine shuffling sequence, looks great! Another advantage might be the use of only one mask constant: 0x8040201008040201:0x8040201008040201 which may be preloaded into one xmm-register. For dynamically weight arrays, i would "paddb" some precalulated arrays, eg. indexed by opposite and/or own (king) square, already inside four xmm-regsiters. Or incremental updated weight tables for w/b pieces of all kind? > >Lastly, are you sure this wouldn't all be faster in the integer pipes? From >what I can see, the integer pipes have as much or more computational power as >the "vector" units. I guess you have to work hard with 64-bit gp-register to get a better performance as with the slow SSE2 instructions, but teach me the lesson - just saw your code - maybe some unrolling? With newest msc-compiler for AMD64 with SSE2-intrinsics, the compiler may schedule some other eventually independent gp-instructions in between. This and the hope for better SSE2-performance (four times?) in the future make me considering these dotProduct routine in my next approach, i am already working on. >If the Opteron could execute 4 SSE2 instructions in >parallel it would be different, but . . . > >anthony Hope 128-bit xmm-alus come true soon and may be some more... At least before i'll buy my next box ;-) Cheers, Gerd
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.