Author: Gerd Isenberg
Date: 03:27:47 05/29/04
Go up one level in this thread
On May 28, 2004 at 17:13:44, Dieter Buerssner wrote:
>>Nice, is it faster?
>
>I did not test. I guess, your method will be a bit faster. I had hoped, you will
>try it out. I really do not have an application, that could use a population
>count for 10 64-bit words. You seem to have one ...
Ok, i postpone my saturday morning business a bit and i did a dumb loop test on
my Athlon32 box, msc6 speed optimization.
The first time reported includes determineAttackCounts but only one popCount
instead of four:
only popCount inlined:
disjointAtta[0] 0x00261E2752F60985
disjointAtta[1] 0x22972E1520AD7E1D
disjointAtta[2] 0x28D2779416DD6DC4
disjointAtta[3] 0x0476011950393E31
disjointAtta[4] 0x22F166AD0BB53958
disjointAtta[5] 0x51F07C9354976532
disjointAtta[6] 0x4819052B70D108C0
disjointAtta[7] 0x25FD7E16098E024E
disjointAtta[8] 0x0348489B420B52F5
disjointAtta[9] 0x5C3B314930A80363
==================================
attackCounts[0] 0x4F835E6EBC07F91E
attackCounts[1] 0x5F6024302D5B024A
attackCounts[2] 0x20EF7F9E523D7FF5
attackCounts[3] 0x0010000100800000
time determineAttackCounts = 73.345 sec/10**9 runs
time 8,4,2,1-popCount = 99.924 sec/10**9 runs
time Dieter's pc_10bb = 85.292 sec/10**9 runs
pc1=-1298309357 pcg=-2126923520 pcd=-2126923520
determineAttackCounts, popCount and pc_10bb inlined:
disjointAtta[0] 0x00261E2752F60985
disjointAtta[1] 0x22972E1520AD7E1D
disjointAtta[2] 0x28D2779416DD6DC4
disjointAtta[3] 0x0476011950393E31
disjointAtta[4] 0x22F166AD0BB53958
disjointAtta[5] 0x51F07C9354976532
disjointAtta[6] 0x4819052B70D108C0
disjointAtta[7] 0x25FD7E16098E024E
disjointAtta[8] 0x0348489B420B52F5
disjointAtta[9] 0x5C3B314930A80363
==================================
attackCounts[0] 0x4F835E6EBC07F91E
attackCounts[1] 0x5F6024302D5B024A
attackCounts[2] 0x20EF7F9E523D7FF5
attackCounts[3] 0x0010000100800000
time determineAttackCounts = 28.941 sec/10**9 runs
time 8,4,2,1-popCount = 54.028 sec/10**9 runs
time Dieter's pc_10bb = 85.122 sec/10**9 runs
pc1=-1298309357 pcg=-2126923520 pcd=-2126923520
Hmm, those dump loop tests are really not reliable!
Should i try performance counter and cpuid?
Anyway determineAttackCounts(10) takes 34 (6*5 + 2*2) bitwise 64-bit
instructions that are 68 32-bit instructions! And on x86-32 there are a lot of
store/loads with local vars. That code cries for 64-bit and a few more registers
as well.
One 64-bit popcount takes about 21 32-bit instructions.
But the O(log(n)) effect of using determineAttackCounts is probably smaller, due
to the number of odd,maj pairs (10,4 32-bit instructions) increases
exponentially as well, eg. for saving 6 popcounts it takes eight odd/maj pairs.
My primary intention was not aggregated bitcounting per se, but to get those
2**0,2**1,2**2 and 2**3 digit bitboards, where each bit on the same bitposition
represents a 4-digit number from 0..10 in this case.
Cheers,
Gerd
Here the main routine to produce the above results.
#define MAX_ITERATIONS 1000000000 // 10**9
int main(int argc, char* argv[])
{
BitBoard disjointAtta[10]; // up to 15
BitBoard attackCounts[4];
clock_t start0, stop0, start1, stop1, start2, stop2;
int i, pc1 = 0, pcg = 0, pcd = 0;
initDisjointAttacks(disjointAtta);
for (i = 0; i < 10; ++i)
{
pc1 += popCount(disjointAtta[i]);
printf("disjointAtta[%d] 0x%08X%08X\n", i,
high(disjointAtta[i]), low(disjointAtta[i]));
}
start0 = clock();
for (i = 0 ; i < MAX_ITERATIONS; i++)
{
determineAttackCounts(attackCounts, disjointAtta);
pc1 += popCount(attackCounts[3]);
disjointAtta[0]++;
}
stop0 = clock();
initDisjointAttacks(disjointAtta);
start1 = clock();
for (i = 0 ; i < MAX_ITERATIONS; i++)
{
determineAttackCounts(attackCounts, disjointAtta);
pcg += popCount(attackCounts[0]);
pcg += popCount(attackCounts[1])*2;
pcg += popCount(attackCounts[2])*4;
pcg += popCount(attackCounts[3])*8;
disjointAtta[0]++;
}
stop1 = clock();
initDisjointAttacks(disjointAtta);
start2 = clock();
for (i = 0 ; i < MAX_ITERATIONS; i++)
{
pcd += pc_10bb(disjointAtta);
disjointAtta[0]++;
}
stop2 = clock();
printf("==================================\n");
for (i = 0; i < 4; ++i)
{
printf("attackCounts[%d] 0x%08X%08X\n", i,
high(attackCounts[i]), low(attackCounts[i]));
}
printf("time determineAttackCounts = %.3f sec/10**9 runs\n", (float)(stop0 -
start0) / CLOCKS_PER_SEC);
printf("time 8,4,2,1-popCount = %.3f sec/10**9 runs\n", (float)(stop1 -
start1) / CLOCKS_PER_SEC);
printf("time Dieter's pc_10bb = %.3f sec/10**9 runs\n", (float)(stop2 -
start2) / CLOCKS_PER_SEC);
printf("pc1=%d pcg=%d pcd=%d\n", pc1, pcg, pcd);
getchar();
return 0;
}
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.