Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Opteron Instruction Set

Author: Dieter Buerssner

Date: 13:19:07 02/03/04

Go up one level in this thread


On February 02, 2004 at 23:14:12, Omid David Tabibi wrote:

>Currently, my only critical assembly parts are the following:

[...]
>__forceinline UINT32 countBitsTrue(UINT32 data) {
>	__asm {
>		mov		ecx, dword ptr data
>		xor		eax, eax
>		test	ecx, ecx
>		jz		l1
>	l0:	lea		edx, [ecx-1]
>		inc		eax
>		and		ecx, edx
>		jnz		l0
>	l1:
>	}
>}

Why are you using inline assembly for this? Any decent C-compiler will come up
with similar code without inline assembly. For MSVC, the C code (when inlined)
will probably be faster than the inline assembly. It won't need the

mov		ecx, dword ptr data

in many cases, because data may already be in a register. Register names will
not be hardcoded, so it gives the compiler more freedom to optimize. With Gcc
things are slightly differnt (you don't need to hardcode registers), but still I
cannot imagine any advantage of inline assembly in this case. And of course
porting to a 64-bit computer will be no issue, when using C.

For example:

typedef unsigned long UINT32;

int countBitsTrue(UINT32 data)
{
  UINT32 w = data;
  int ret=0;
  if (w)
    do
     ret++;
    while ((w &= w-1) != 0);
  return ret;
}


cl -O2 -Fa popc32.c produces the following assembly:

        TITLE   popc32.c
        .386P
include listing.inc
if @Version gt 510
.model FLAT
else
_TEXT   SEGMENT PARA USE32 PUBLIC 'CODE'
_TEXT   ENDS
_DATA   SEGMENT DWORD USE32 PUBLIC 'DATA'
_DATA   ENDS
CONST   SEGMENT DWORD USE32 PUBLIC 'CONST'
CONST   ENDS
_BSS    SEGMENT DWORD USE32 PUBLIC 'BSS'
_BSS    ENDS
_TLS    SEGMENT DWORD USE32 PUBLIC 'TLS'
_TLS    ENDS
;       COMDAT _countBitsTrue
_TEXT   SEGMENT PARA USE32 PUBLIC 'CODE'
_TEXT   ENDS
FLAT    GROUP _DATA, CONST, _BSS
        ASSUME  CS: FLAT, DS: FLAT, SS: FLAT
endif
PUBLIC  _countBitsTrue
;       COMDAT _countBitsTrue
_TEXT   SEGMENT
_data$ = 8
_countBitsTrue PROC NEAR                                ; COMDAT
; File popc32.c
; Line 5
        mov     ecx, DWORD PTR _data$[esp-4]
; Line 6
        xor     eax, eax
; Line 7
        test    ecx, ecx
        je      SHORT $L96
$L94:
; Line 10
        lea     edx, DWORD PTR [ecx-1]
        inc     eax
        and     ecx, edx
        jne     SHORT $L94
$L96:
; Line 12
        ret     0
_countBitsTrue ENDP
_TEXT   ENDS
END

It is practically identical to your inline assembly!. It should never be slower!
Similar when using Gcc (produces more or less the same code). I think one should
stay away from inline assembly for this. I had some inline assembly in Yace for
a while, but got rid of it all. I would probably use it for FirstBit - but don't
need that really. For a 64 bit version of popcount on 32 bit hardware, that runs
(probably at least) as fast as the same algorithm with inline assembly see
http://f11.parsimony.net/forum16635/messages/31324.htm

Regards,
Dieter



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.