Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: just another reverse bitscan

Author: Gerd Isenberg
Date: 10:45:44 12/23/05
On December 23, 2005 at 12:25:31, Frederik Tack wrote:

>On December 23, 2005 at 10:49:00, Gerd Isenberg wrote:
>
>>>About the assembler routine being slower :
>>>I found that using a call by reference instead of a call by value speeds up the
>>>function in Delphi a great deal because you get the adress of the bitboard
>>>directly in the EAX register, don't know about C though but perhaps it will be
>>>the same or perhaps its just a thing of the Delphi compiler.
>>
>>
>>Yes, but even if the pointer is passed via a register (fastcall calling
>>convention in C is via ECX or EDX in 32-bit mode) your bitboard has to be in
>>memory - even if it is already in a register pair.
>
>That's true, it has to be a variable that is passed. If i have to scan a
>constant bitboard or a result from a logical expression, i use another function.
>Most of the times though, its a variable.
>
>>
>>Does Delfi inline short functions? Or is it allways a function call?
>>
>
>Not sure what you mean with 'short function'. I'm not familiar with C.


Something like up to two cachelines or 128 bytes.
AMD says something about 25 machine instructions.

------------------------------------------------------------------------------
Software Optimization Guide
for AMD64 Processors
25112 Rev. 3.06 September 2005

7.3 Inline Functions

Optimization
Use function inlining when:
• A function is called from just one site in the code. (For the C language,
determination of this characteristic is made easier if functions are explicitly
declared static unless they require external linkage.)
• A function—once inlined—contains fewer than 25 machine instructions.

Application
This optimization applies to:
• 32-bit software
• 64-bit software

Rationale
There are advantages and disadvantages to function inlining. On the one hand,
function inlining eliminates function-call overhead and allows better register
allocation and instruction scheduling at the site of the function call. The
disadvantage of function inlining is decreased code reference locality, which
can increase execution time due to instruction cache misses. For functions that
create fewer than 25 machine instructions once inlined, it is likely that the
functioncall overhead is close to, or more than, the time spent executing the
function body. In these cases, function inlining is recommended.

Function-call overhead on the AMD Athlon 64 and AMD Opteron processors can be
low because calls and returns are executed very quickly due to the use of
prediction mechanisms. However, there is still overhead due to passing function
arguments through memory, which creates store-to-loadforwarding dependencies.
(In 64-bit mode, this overhead is typically avoided by passing more
arguments in registers, as specified in the AMD64 Application Binary Interface
[ABI] for the operating system.) For longer functions, the benefits of reduced
function-call overhead give diminishing returns. A function that results in the
insertion of more than 500 machine instructions at the call site should probably
not be inlined. Some larger functions might consist of multiple, relatively
short paths that are negatively affected by function overhead. In such a case,
it can be advantageous to inline larger functions. Profiling information is the
best guide in determining whether to inline such large functions.
------------------------------------------------------------------------------


Lets take this loop to show the drawbacks for msc inline-assembly:
(gcc is smarter here, but has ata-syntax which is imho a bit ugly).

  BitBoard one = 1;
  BitBoard bb = ..;
  while (bb) {
    int i = bitScanReverse(bb);
    ...
    bb ^= one << i; // reset bit
  }

The loop is time critical and i don't like to call a subroutine with
pushparams/call/retx overhead each time inside the loop body.

If a subroutine is relative short (ok, i have 36 instructions here), the
push/call/ret-overhead becomes relative huge if all params are already inside a
processor register.

In C you can use macros - or in C/C++ inline functions, implicitly and
explicitely with special qualifiers such as inline, __inline or __forceinline.
The compiler usually (there are exceptions, if the inlined fuctions is too huge
or incarnated too often) reproduces or incarnates the whole code of the callee
inside the callers code. Total code becomes larger but usually faster due to big
trace- or L1-code caches.

An aspect of inlining is that otherwise sequential functions may pair better
since two independent instruction chains may be interlaced scheduled by the
compiler with less dependencies and better ipc.

For instance if you have two independent chains with only an ipc of one each,
instead of the max of three - and you have enough registers free (which might be
a problem with x86/32) - then it is possible to improve ipc up to two and to
"double" the performance due to better "parallel" work...

Don't confuse inline-asm with inlining in C in general. That are disjoint
properties. An inline-asm functions may be inlined or not - the second implies
the "usual" function call.

Here some generated assembly of the msc6-compiler with this inlined inline-asm:

__forceinline
unsigned int bitScanReverse(BitBoard bb)
{
  __asm
  {
    bsr eax,[bb]
    bsr eax,[bb+4]
    setnz dl
    shl dl,5
    xor al,dl
  }
}
    ...
    int i = bitScanReverse(bb);
    ...

  0000d	89 7c 24 10	 mov	 DWORD PTR _bb$[esp+24], edi
  00011	89 6c 24 14	 mov	 DWORD PTR _bb$[esp+28], ebp
  00015	bb 01 00 00 00	 mov	 ebx, 1
  0001a	8d 9b 00 00 00
	00		 npad	 6
$L665:
  00020	0f bd 44 24 10	 bsr	 eax, DWORD PTR _bb$[esp+24]
  00025	0f bd 44 24 14	 bsr	 eax, DWORD PTR _bb$[esp+28]
  0002a	0f 95 c2	 setne	 dl
  0002d	c0 e2 05	 shl	 dl, 5
  00030	32 c2		 xor	 al, dl
...
  00057	89 7c 24 10	 mov	 DWORD PTR _bb$[esp+24], edi
  0005b	89 6c 24 14	 mov	 DWORD PTR _bb$[esp+28], ebp
...
                         jxx     $L665

You see the Bitboard is already in edi, ebp.
The compiler inlines the bsf-based routine, but the "fixed" assembly requires
passing the bitboard via stack - and the compiler is not able to "optimize" the
redundant store/loads away like in this fragment.

	 mov	 ebx, 1
$L665:
	 bsr	 eax, edi
	 bsr	 eax, ebp
	 setne	 dl
	 shl	 dl, 5
	 xor	 al, dl
...
         jxx     $L665

Inlining or not inlining does not matter so much in speed with this assembly
function - 29 cycles inlined versus 31 cycles not inlined. The C-routine has
either 23 cycles inlined versus 34 cycles not inlined.

Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.