Author: Anthony Cozzie
Date: 06:41:48 07/21/04
Go up one level in this thread
On July 21, 2004 at 05:16:00, Gerd Isenberg wrote: >On July 20, 2004 at 12:39:17, Anthony Cozzie wrote: > >>On July 20, 2004 at 12:31:59, Anthony Cozzie wrote: >> >>>What do you think of the following C code: >>> >>>int bb_dot_product(bitboard a, unsigned char *weights) >>>{ >>> bitboard t, t1, *_weights = weights; >>> static bitboard table[256] = {correct translations, e.g. 0xFF -> 0xffffffff} >>> >>> //we count on the compiler to unroll this loop. >>> for(i = 0; i < 8; i++, a ) { >>> t = table[(a >> i*8) & 0xFF] & weights[i]; >>> t1 = t; >>> t << 8; >>> sum += (t & 0x00FF00FF00FF00FF) + (t1 & 0x00FF00FF00FF00FF); >>> } >>> >>> sum = (sum & 0x0000FFFF0000FFFF) + ((sum >> 16) & 0x0000FFFF0000FFFF); >>> sum = ((sum >> 32) + sum & 0x00000000FFFFFFFF); >>> return sum; >>>} >>> >>>It has several advantages: Can use full 0-255 for each weight, the table does >>>not have to be rotated, and there is no penalty for moving between the integer >>>and MMX pipes. >>> >>>OTOH, this solution is also much less cache friendly, requiring maybe 2x the >>>number of instructions and also needed 2KB of data cache. >>> >>>anthony >> >>Could someone with a 64-bit MSVC compile this code and post the assembly? (and >>if they were really nice, benchmark it? :) >> >>anthony > >lets count instructions: > > 7 and > 7 shift for table index > > 8 mov table to gp > 8 and with weight > > 8 mov > 8 shift by 8 >16 and 0x00ff >16 add bytes > > 3 and > 2 shift > 2 add > >so about 85 instructions, may be a few more moves. >So i guess about 28-30 cycles in best case. >Not to mention the possible overhead of the callee, >e.g. saving/restoring some locals inside gp-registers, >due to higher register pressure, and some more cachelines >for the table access. > >For instance one may call inlined SSE2 dotProduct and some gp-Kogge-Stone filler >simultaniously e.g. to get more progressive attacks of safe attacks and to look >for some boolean properties such as rook trajetories to attack the opposite king > - or whatever. > >As Eugene demonstrates with some sample source, the new ms compiler is able to >inter-schedule independent instructions from different inline sources, like >three gp with one SSE2. > >As i already have designed this bitboard pair wrapper classes with >SSE2-intrinsics and gp-routines - and fill routines with <XMM> or <GP>-type >templates, i will implement both dotProducts in a similar way as class-members: > > XMM ra(rookattacks); > GP ba(bishopattacks); > bonus = ra.dotProduct(rookWeights) + ba.dotProduct(bishopWeights); > >Gerd Unfortunately I think only 1 integer instruction can be decoded if an SSE2 instruction is decoded, but it should definitely be possible for the decoder to "get ahead" of the instruction stream - it should take only 2 cycles to decode 3 SSE2 instructions. anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.