Author: Gerd Isenberg
Date: 08:25:49 03/11/04
Go up one level in this thread
On March 11, 2004 at 10:31:22, milix wrote: >Hello to everyone. >I'm trying to optimize various parts of my bitboard engine and I found an >optimization for my popcount() routine. I was using the routine: > >int popcount_1(const bitboard bb) { > register int cnt=0; > register bitboard a=bb; > while (a) { > cnt += bytecount[a & 0xff]; > a >>= 8; > } > return cnt; >} > >where bytecount is a looup table for byte counts. > >I changed this routine a bit to > >int popcount_2(const bitboard bb) { > register int cnt=0; > unsigned long a1 = *((unsigned long*)&bb); > unsigned long a2 = *(((unsigned long*)&bb)+1); > while (a1) { > cnt += bytecount[a1 & 0xff]; > a1 >>= 8; > } > while (a2) { > cnt += bytecount[a2 & 0xff]; > a2 >>= 8; > } > return cnt; >} > >and the routine popcount_2 was 2x times faster than popcount_1. It is even >faster than the assembly routine provided by AMD for popcounts. I tested the >routines using 1000 randomly generated bitboards. >Do I miss something here or it is an improvement? Have anyone else tried a >similar idea? > >My computer specs are: >AMD Athlon XP 1600+ >512 MB RAM If you look to the generated assembly, it became obvious: The 64 bit-shift right 8. SHRD is 6-cycle Vector path instruction: shrd ecx, edx, 8 shr edx, 8 I still found the loopless SWAR-popcount faster inside my chess program. Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.