Author: Gerd Isenberg
Date: 11:05:33 07/21/04
Go up one level in this thread
On July 21, 2004 at 13:54:12, Gerd Isenberg wrote: >On July 20, 2004 at 12:31:59, Anthony Cozzie wrote: > >>What do you think of the following C code: >> >>int bb_dot_product(bitboard a, unsigned char *weights) >>{ >> bitboard t, t1, *_weights = weights; >> static bitboard table[256] = {correct translations, e.g. 0xFF -> 0xffffffff} >> >> //we count on the compiler to unroll this loop. >> for(i = 0; i < 8; i++, a ) { >> t = table[(a >> i*8) & 0xFF] & weights[i]; >> t1 = t; >> t << 8; >> sum += (t & 0x00FF00FF00FF00FF) + (t1 & 0x00FF00FF00FF00FF); >> } >> >> sum = (sum & 0x0000FFFF0000FFFF) + ((sum >> 16) & 0x0000FFFF0000FFFF); >> sum = ((sum >> 32) + sum & 0x00000000FFFFFFFF); >> return sum; >>} >> >>It has several advantages: Can use full 0-255 for each weight, the table does >>not have to be rotated, and there is no penalty for moving between the integer >>and MMX pipes. >> >>OTOH, this solution is also much less cache friendly, requiring maybe 2x the >>number of instructions and also needed 2KB of data cache. >> >>anthony > >probably some minor improvements. >Save the inner shift by building two intermediate results. >Whether the byte access it worth instead of shift/and? >On x86-32 it was. > > unsigned char *bptr = & (unsigned char) a; > sum0 = 0, sum1 = 0; > for(i = 0; i < 8; i++, ptr++ ) { > t = table[*bptr] & weights[i]; > sum0 += t & 0x00FF00FF00FF00FF > sum1 += t & 0xFF00FF00FF00FF00; Ok, that doesn't work correctly because the highest byte has no overflow bits. Therefore the >> 8 is necessary inside the loop! > } > sum = sum0 + (sum1>>8); > sum = (sum & 0x0000FFFF0000FFFF) + ((sum >> 16) & 0x0000FFFF0000FFFF); > sum = ((sum >> 32) + sum & 0x00000000FFFFFFFF); > return sum;
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.