Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: PopCount optimization

Author: Gerd Isenberg

Date: 08:25:49 03/11/04

On March 11, 2004 at 10:31:22, milix wrote:

>Hello to everyone.
>I'm trying to optimize various parts of my bitboard engine and I found an
>optimization for my popcount() routine. I was using the routine:
>
>int popcount_1(const bitboard bb) {
>   register int cnt=0;
>   register bitboard a=bb;
>   while (a) {
>      cnt += bytecount[a & 0xff];
>      a >>= 8;
>   }
>   return cnt;
>}
>
>where bytecount is a looup table for byte counts.
>
>I changed this routine a bit to
>
>int popcount_2(const bitboard bb) {
>   register int cnt=0;
>   unsigned long a1 = *((unsigned long*)&bb);
>   unsigned long a2 = *(((unsigned long*)&bb)+1);
>   while (a1) {
>      cnt += bytecount[a1 & 0xff];
>      a1 >>= 8;
>   }
>   while (a2) {
>      cnt += bytecount[a2 & 0xff];
>      a2 >>= 8;
>   }
>   return cnt;
>}
>
>and the routine popcount_2 was 2x times faster than popcount_1. It is even
>faster than the assembly routine provided by AMD for popcounts. I tested the
>routines using 1000 randomly generated bitboards.
>Do I miss something here or it is an improvement? Have anyone else tried a
>similar idea?
>
>My computer specs are:
>AMD Athlon XP 1600+
>512 MB RAM

If you look to the generated assembly, it became obvious:
The 64 bit-shift right 8. SHRD is 6-cycle Vector path instruction:

	shrd	ecx, edx, 8
	shr	edx, 8

I still found the loopless SWAR-popcount faster inside my chess program.

Gerd

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.