Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Possible small improvement to hacky method

Author: Matt Taylor

Date: 02:33:42 12/06/02

Go up one level in this thread


On December 06, 2002 at 00:49:11, Walter Faxon wrote:

>On December 02, 2002 at 20:27:39, Gerd Isenberg wrote:
>
>>Ahh.. i see,
>>
>>a special kind of popcount for consecutive trailing "ones".
>>I love this interactive programming lessions.
>>
>>Gerd
>>
>>Btw.: Impressive assembler listing:
>>
>>00401003 8B 4C 24 0C          mov         ecx,dword ptr [esp+0Ch]
>>00401007 8B 31                mov         esi,dword ptr [ecx]
>>00401009 8B 79 04             mov         edi,dword ptr [ecx+4]
>>0040100C 8B D6                mov         edx,esi
>>0040100E 8B C7                mov         eax,edi
>>00401010 83 C2 FF             add         edx,0FFFFFFFFh
>>00401013 83 D0 FF             adc         eax,0FFFFFFFFh
>>00401016 23 F2                and         esi,edx
>>00401018 23 F8                and         edi,eax
>>0040101A 89 31                mov         dword ptr [ecx],esi
>>0040101C 89 79 04             mov         dword ptr [ecx+4],edi
>>0040101F 8B CF                mov         ecx,edi
>>00401021 33 D6                xor         edx,esi
>>00401023 33 C1                xor         eax,ecx
>>00401025 33 C2                xor         eax,edx
>>00401027 5F                   pop         edi
>>00401028 35 81 FC C5 01       xor         eax,1C5FC81h
>>0040102D 5E                   pop         esi
>>0040102E 8B D0                mov         edx,eax
>>00401030 C1 EA 10             shr         edx,10h
>>00401033 03 C2                add         eax,edx
>>00401035 33 D2                xor         edx,edx
>>00401037 8B C8                mov         ecx,eax
>>00401039 C1 E9 08             shr         ecx,8
>>0040103C 2A C1                sub         al,cl
>>0040103E 25 FF 00 00 00       and         eax,0FFh
>>00401043 8A 90 D5 62 40 00    mov         dl,byte ptr [eax+4062D5h]
>>00401049 8B C2                mov         eax,edx
>>0040104B C3                   ret
>
>
>Hi, Gerd.
>
>I have noticed that most compilers aren't too smart where it comes to bitwise
>dataflow analysis.  After the "add  eax, edx" instruction above, better (well,
>shorter!) code would be:
>
>    sub    al, ah
>    and    eax, 0FFh
>    mov    al, byte ptr [eax+4062D5h]
>    ret
>
>Perhaps if we added a little "hint" in the last 2 lines of source:
>
>inline                  // inline declaration may differ by compiler
>u8 LSB_64( u64* bb )
>    {
>    u64 t64;
>    u32 t32;
>    u8 t8;
>    t64 = *bb - 1;
>    *bb &= t64;         // omit this line to retain current LSB
>    t64 ^= *bb;
>    t32 = (u32)t64 ^ (u32)(t64 >> 32);
>    t32 ^= LSB_64_magic;
>    t32 += t32 >> 16;
>    t8 = (u8)t32 - (u8)(t32 >> 8);
>    return LSB_64_table [LSB_64_adj + t8];
>    }
>
>Of course, this really only helps because we already know that the x86 class can
>independently address both the first and 2nd bytes of a register.  And it's
>wishful thinking that much better asm code will actually be produced!  I'd be
>interested to see.
>
>If/when you run your test suite again, please try this version.  Thanks.
>
>-- Walter

It may run faster, but it's hard to say just by looking. The superscalar x86
processors end up doing register remapping against some ~40 internal registers.
The P6 core (PPro/Pentium 2/Pentium 3) has real issues, and the subtraction
there may cause some stalls. Athlon doesn't suffer from such defects to my
knowledge.

It's also a tad strange that the code loads dl and then copies edx into eax. It
would be more direct to simply store the table value in eax.

Here's my timing data. These are the number of clocks on my AthlonMP 1600 to
execute the loop:
	while(bb != 0)
	{
		BitScanAndReset();
	}

(Not all the headings make sense because I reused some timing code that I was
using to measure other things.) The execution time is listed as min-max/ave.

Baseline is the timer latency bias.
bsf, pi2fd, and btr are algorithm posted previously
gerd is the assembly listing Gerd gave for your function
vc++ is the assembly listing the VC 7 compiler gave me
me is my own hand-tweaked assembly listing

bb = 0x1111113300000000

Type      Instruction Set  Total Time  Execution Time
*****************************************************
Baseline  N/A              105         N/A
bsf       Integer          1280        1175-1175/1175
pi2fd     Integer          1420        1315-1315/1315
btr       Integer          1363        1235-1263/1258
gerd      Integer          1240        1135-1135/1135
vc++      Integer          1240        1135-1135/1135
me        Integer          1240        1135-1135/1135

bb = 0x0000000011111133

Type      Instruction Set  Total Time  Execution Time
*****************************************************
Baseline  N/A              105         N/A
bsf       Integer          1181        1075-1103/1076
pi2fd     Integer          1420        1315-1315/1315
btr       Integer          1366        1235-1298/1261
gerd      Integer          1240        1135-1135/1135
vc++      Integer          1240        1135-1140/1135
me        Integer          1240        1135-1135/1135

bb = 0x1010111010101110

Type      Instruction Set  Total Time  Execution Time
*****************************************************
Baseline  N/A              105         N/A
bsf       Integer          1247        1136-1201/1142
pi2fd     Integer          1420        1315-1315/1315
btr       Integer          1365        1235-1298/1260
gerd      Integer          1240        1135-1135/1135
vc++      Integer          1240        1135-1135/1135
me        Integer          1240        1135-1135/1135

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.