Author: Matt Taylor
Date: 02:33:42 12/06/02
Go up one level in this thread
On December 06, 2002 at 00:49:11, Walter Faxon wrote: >On December 02, 2002 at 20:27:39, Gerd Isenberg wrote: > >>Ahh.. i see, >> >>a special kind of popcount for consecutive trailing "ones". >>I love this interactive programming lessions. >> >>Gerd >> >>Btw.: Impressive assembler listing: >> >>00401003 8B 4C 24 0C mov ecx,dword ptr [esp+0Ch] >>00401007 8B 31 mov esi,dword ptr [ecx] >>00401009 8B 79 04 mov edi,dword ptr [ecx+4] >>0040100C 8B D6 mov edx,esi >>0040100E 8B C7 mov eax,edi >>00401010 83 C2 FF add edx,0FFFFFFFFh >>00401013 83 D0 FF adc eax,0FFFFFFFFh >>00401016 23 F2 and esi,edx >>00401018 23 F8 and edi,eax >>0040101A 89 31 mov dword ptr [ecx],esi >>0040101C 89 79 04 mov dword ptr [ecx+4],edi >>0040101F 8B CF mov ecx,edi >>00401021 33 D6 xor edx,esi >>00401023 33 C1 xor eax,ecx >>00401025 33 C2 xor eax,edx >>00401027 5F pop edi >>00401028 35 81 FC C5 01 xor eax,1C5FC81h >>0040102D 5E pop esi >>0040102E 8B D0 mov edx,eax >>00401030 C1 EA 10 shr edx,10h >>00401033 03 C2 add eax,edx >>00401035 33 D2 xor edx,edx >>00401037 8B C8 mov ecx,eax >>00401039 C1 E9 08 shr ecx,8 >>0040103C 2A C1 sub al,cl >>0040103E 25 FF 00 00 00 and eax,0FFh >>00401043 8A 90 D5 62 40 00 mov dl,byte ptr [eax+4062D5h] >>00401049 8B C2 mov eax,edx >>0040104B C3 ret > > >Hi, Gerd. > >I have noticed that most compilers aren't too smart where it comes to bitwise >dataflow analysis. After the "add eax, edx" instruction above, better (well, >shorter!) code would be: > > sub al, ah > and eax, 0FFh > mov al, byte ptr [eax+4062D5h] > ret > >Perhaps if we added a little "hint" in the last 2 lines of source: > >inline // inline declaration may differ by compiler >u8 LSB_64( u64* bb ) > { > u64 t64; > u32 t32; > u8 t8; > t64 = *bb - 1; > *bb &= t64; // omit this line to retain current LSB > t64 ^= *bb; > t32 = (u32)t64 ^ (u32)(t64 >> 32); > t32 ^= LSB_64_magic; > t32 += t32 >> 16; > t8 = (u8)t32 - (u8)(t32 >> 8); > return LSB_64_table [LSB_64_adj + t8]; > } > >Of course, this really only helps because we already know that the x86 class can >independently address both the first and 2nd bytes of a register. And it's >wishful thinking that much better asm code will actually be produced! I'd be >interested to see. > >If/when you run your test suite again, please try this version. Thanks. > >-- Walter It may run faster, but it's hard to say just by looking. The superscalar x86 processors end up doing register remapping against some ~40 internal registers. The P6 core (PPro/Pentium 2/Pentium 3) has real issues, and the subtraction there may cause some stalls. Athlon doesn't suffer from such defects to my knowledge. It's also a tad strange that the code loads dl and then copies edx into eax. It would be more direct to simply store the table value in eax. Here's my timing data. These are the number of clocks on my AthlonMP 1600 to execute the loop: while(bb != 0) { BitScanAndReset(); } (Not all the headings make sense because I reused some timing code that I was using to measure other things.) The execution time is listed as min-max/ave. Baseline is the timer latency bias. bsf, pi2fd, and btr are algorithm posted previously gerd is the assembly listing Gerd gave for your function vc++ is the assembly listing the VC 7 compiler gave me me is my own hand-tweaked assembly listing bb = 0x1111113300000000 Type Instruction Set Total Time Execution Time ***************************************************** Baseline N/A 105 N/A bsf Integer 1280 1175-1175/1175 pi2fd Integer 1420 1315-1315/1315 btr Integer 1363 1235-1263/1258 gerd Integer 1240 1135-1135/1135 vc++ Integer 1240 1135-1135/1135 me Integer 1240 1135-1135/1135 bb = 0x0000000011111133 Type Instruction Set Total Time Execution Time ***************************************************** Baseline N/A 105 N/A bsf Integer 1181 1075-1103/1076 pi2fd Integer 1420 1315-1315/1315 btr Integer 1366 1235-1298/1261 gerd Integer 1240 1135-1135/1135 vc++ Integer 1240 1135-1140/1135 me Integer 1240 1135-1135/1135 bb = 0x1010111010101110 Type Instruction Set Total Time Execution Time ***************************************************** Baseline N/A 105 N/A bsf Integer 1247 1136-1201/1142 pi2fd Integer 1420 1315-1315/1315 btr Integer 1365 1235-1298/1260 gerd Integer 1240 1135-1135/1135 vc++ Integer 1240 1135-1135/1135 me Integer 1240 1135-1135/1135 -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.