Author: Gerd Isenberg
Date: 16:14:18 12/10/02
Go up one level in this thread
On December 10, 2002 at 15:57:13, Matt Taylor wrote: >On December 06, 2002 at 00:49:11, Walter Faxon wrote: > >>On December 02, 2002 at 20:27:39, Gerd Isenberg wrote: >> >>>Ahh.. i see, >>> >>>a special kind of popcount for consecutive trailing "ones". >>>I love this interactive programming lessions. >>> >>>Gerd >>> >>>Btw.: Impressive assembler listing: >>> >>>00401003 8B 4C 24 0C mov ecx,dword ptr [esp+0Ch] >>>00401007 8B 31 mov esi,dword ptr [ecx] >>>00401009 8B 79 04 mov edi,dword ptr [ecx+4] >>>0040100C 8B D6 mov edx,esi >>>0040100E 8B C7 mov eax,edi >>>00401010 83 C2 FF add edx,0FFFFFFFFh >>>00401013 83 D0 FF adc eax,0FFFFFFFFh >>>00401016 23 F2 and esi,edx >>>00401018 23 F8 and edi,eax >>>0040101A 89 31 mov dword ptr [ecx],esi >>>0040101C 89 79 04 mov dword ptr [ecx+4],edi >>>0040101F 8B CF mov ecx,edi >>>00401021 33 D6 xor edx,esi >>>00401023 33 C1 xor eax,ecx >>>00401025 33 C2 xor eax,edx >>>00401027 5F pop edi >>>00401028 35 81 FC C5 01 xor eax,1C5FC81h >>>0040102D 5E pop esi >>>0040102E 8B D0 mov edx,eax >>>00401030 C1 EA 10 shr edx,10h >>>00401033 03 C2 add eax,edx >>>00401035 33 D2 xor edx,edx >>>00401037 8B C8 mov ecx,eax >>>00401039 C1 E9 08 shr ecx,8 >>>0040103C 2A C1 sub al,cl >>>0040103E 25 FF 00 00 00 and eax,0FFh >>>00401043 8A 90 D5 62 40 00 mov dl,byte ptr [eax+4062D5h] >>>00401049 8B C2 mov eax,edx >>>0040104B C3 ret >> >> >>Hi, Gerd. >> >>I have noticed that most compilers aren't too smart where it comes to bitwise >>dataflow analysis. After the "add eax, edx" instruction above, better (well, >>shorter!) code would be: >> >> sub al, ah >> and eax, 0FFh >> mov al, byte ptr [eax+4062D5h] >> ret >> >>Perhaps if we added a little "hint" in the last 2 lines of source: >> >>inline // inline declaration may differ by compiler >>u8 LSB_64( u64* bb ) >> { >> u64 t64; >> u32 t32; >> u8 t8; >> t64 = *bb - 1; >> *bb &= t64; // omit this line to retain current LSB >> t64 ^= *bb; >> t32 = (u32)t64 ^ (u32)(t64 >> 32); >> t32 ^= LSB_64_magic; >> t32 += t32 >> 16; >> t8 = (u8)t32 - (u8)(t32 >> 8); >> return LSB_64_table [LSB_64_adj + t8]; >> } >> >>Of course, this really only helps because we already know that the x86 class can >>independently address both the first and 2nd bytes of a register. And it's >>wishful thinking that much better asm code will actually be produced! I'd be >>interested to see. >> >>If/when you run your test suite again, please try this version. Thanks. >> >>-- Walter > >I wanted to procrastinate studying for my finals, and I was writing this routine >using MMX. MMX has an 8-byte word size, and some of these operations (like the >xor) become atomic. The only problem is that I can't do 64-bit subtraction. >Elsewhere the CPU makes emulation of large arithmetic quite easy by allowing one >to add in (or subtract out) the carry bit, but MMX has no easy way of doing >that. > >(As a side-note, the latest extensions to SSE include the ability to do atomic >64-bit arithmetic in MMX registers, too, but since I do not have a P4, I'm not >interested in exploring that route at present.) > >However, it dawned on me that perhaps I don't -need- to generate a carry. I was >looking at the first 3 lines: > >> t64 = *bb - 1; >> *bb &= t64; // omit this line to retain current LSB >> t64 ^= *bb; > >It seems to me that this would create a mask in t64 of 2^LSB - 1, right? > >-Matt Hi Matt, my fastest to far bb-1: // bb in mm0 pxor mm2, mm2 ; 0 pcmpeqd mm3, mm3 ; -1 pcmpeqd mm2, mm0 ; -hih*(bb == 0) : -low*(bb == 0) PUNPCKHDQ mm3, mm2 ; -low*(bb == 0) : -1 paddd mm0, mm3 ; +(-1) Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.