Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSE2 bit[64] * byte[64] dot product

Author: Anthony Cozzie

Date: 06:41:48 07/21/04

Go up one level in this thread


On July 21, 2004 at 05:16:00, Gerd Isenberg wrote:

>On July 20, 2004 at 12:39:17, Anthony Cozzie wrote:
>
>>On July 20, 2004 at 12:31:59, Anthony Cozzie wrote:
>>
>>>What do you think of the following C code:
>>>
>>>int bb_dot_product(bitboard a, unsigned char *weights)
>>>{
>>>   bitboard t, t1, *_weights = weights;
>>>   static bitboard table[256] = {correct translations, e.g. 0xFF -> 0xffffffff}
>>>
>>>   //we count on the compiler to unroll this loop.
>>>   for(i = 0; i < 8; i++, a ) {
>>>      t = table[(a >> i*8) & 0xFF] & weights[i];
>>>      t1 = t;
>>>      t << 8;
>>>      sum += (t & 0x00FF00FF00FF00FF) + (t1 & 0x00FF00FF00FF00FF);
>>>   }
>>>
>>>   sum = (sum & 0x0000FFFF0000FFFF) + ((sum >> 16) & 0x0000FFFF0000FFFF);
>>>   sum = ((sum >> 32) + sum & 0x00000000FFFFFFFF);
>>>   return sum;
>>>}
>>>
>>>It has several advantages:  Can use full 0-255 for each weight, the table does
>>>not have to be rotated, and there is no penalty for moving between the integer
>>>and MMX pipes.
>>>
>>>OTOH, this solution is also much less cache friendly, requiring maybe 2x the
>>>number of instructions and also needed 2KB of data cache.
>>>
>>>anthony
>>
>>Could someone with a 64-bit MSVC compile this code and post the assembly? (and
>>if they were really nice, benchmark it? :)
>>
>>anthony
>
>lets count instructions:
>
> 7 and
> 7 shift for table index
>
> 8 mov table to gp
> 8 and with weight
>
> 8 mov
> 8 shift by 8
>16 and 0x00ff
>16 add bytes
>
> 3 and
> 2 shift
> 2 add
>
>so about 85 instructions, may be a few more moves.
>So i guess about 28-30 cycles in best case.
>Not to mention the possible overhead of the callee,
>e.g. saving/restoring some locals inside gp-registers,
>due to higher register pressure, and some more cachelines
>for the table access.
>
>For instance one may call inlined SSE2 dotProduct and some gp-Kogge-Stone filler
>simultaniously e.g. to get more progressive attacks of safe attacks and to look
>for some boolean properties such as rook trajetories to attack the opposite king
> - or whatever.
>
>As Eugene demonstrates with some sample source, the new ms compiler is able to
>inter-schedule independent instructions from different inline sources, like
>three gp with one SSE2.
>
>As i already have designed this bitboard pair wrapper classes with
>SSE2-intrinsics and gp-routines - and fill routines with <XMM> or <GP>-type
>templates, i will implement both dotProducts in a similar way as class-members:
>
>  XMM ra(rookattacks);
>  GP  ba(bishopattacks);
>  bonus = ra.dotProduct(rookWeights) + ba.dotProduct(bishopWeights);
>
>Gerd

Unfortunately I think only 1 integer instruction can be decoded if an SSE2
instruction is decoded, but it should definitely be possible for the decoder to
"get ahead" of the instruction stream - it should take only 2 cycles to decode 3
SSE2 instructions.

anthony



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.