Author: Gerd Isenberg
Date: 02:16:00 07/21/04
Go up one level in this thread
On July 20, 2004 at 12:39:17, Anthony Cozzie wrote:
>On July 20, 2004 at 12:31:59, Anthony Cozzie wrote:
>
>>What do you think of the following C code:
>>
>>int bb_dot_product(bitboard a, unsigned char *weights)
>>{
>> bitboard t, t1, *_weights = weights;
>> static bitboard table[256] = {correct translations, e.g. 0xFF -> 0xffffffff}
>>
>> //we count on the compiler to unroll this loop.
>> for(i = 0; i < 8; i++, a ) {
>> t = table[(a >> i*8) & 0xFF] & weights[i];
>> t1 = t;
>> t << 8;
>> sum += (t & 0x00FF00FF00FF00FF) + (t1 & 0x00FF00FF00FF00FF);
>> }
>>
>> sum = (sum & 0x0000FFFF0000FFFF) + ((sum >> 16) & 0x0000FFFF0000FFFF);
>> sum = ((sum >> 32) + sum & 0x00000000FFFFFFFF);
>> return sum;
>>}
>>
>>It has several advantages: Can use full 0-255 for each weight, the table does
>>not have to be rotated, and there is no penalty for moving between the integer
>>and MMX pipes.
>>
>>OTOH, this solution is also much less cache friendly, requiring maybe 2x the
>>number of instructions and also needed 2KB of data cache.
>>
>>anthony
>
>Could someone with a 64-bit MSVC compile this code and post the assembly? (and
>if they were really nice, benchmark it? :)
>
>anthony
lets count instructions:
7 and
7 shift for table index
8 mov table to gp
8 and with weight
8 mov
8 shift by 8
16 and 0x00ff
16 add bytes
3 and
2 shift
2 add
so about 85 instructions, may be a few more moves.
So i guess about 28-30 cycles in best case.
Not to mention the possible overhead of the callee,
e.g. saving/restoring some locals inside gp-registers,
due to higher register pressure, and some more cachelines
for the table access.
For instance one may call inlined SSE2 dotProduct and some gp-Kogge-Stone filler
simultaniously e.g. to get more progressive attacks of safe attacks and to look
for some boolean properties such as rook trajetories to attack the opposite king
- or whatever.
As Eugene demonstrates with some sample source, the new ms compiler is able to
inter-schedule independent instructions from different inline sources, like
three gp with one SSE2.
As i already have designed this bitboard pair wrapper classes with
SSE2-intrinsics and gp-routines - and fill routines with <XMM> or <GP>-type
templates, i will implement both dotProducts in a similar way as class-members:
XMM ra(rookattacks);
GP ba(bishopattacks);
bonus = ra.dotProduct(rookWeights) + ba.dotProduct(bishopWeights);
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.