Author: Gerd Isenberg
Date: 10:45:44 12/23/05
Go up one level in this thread
On December 23, 2005 at 12:25:31, Frederik Tack wrote: >On December 23, 2005 at 10:49:00, Gerd Isenberg wrote: > >>>About the assembler routine being slower : >>>I found that using a call by reference instead of a call by value speeds up the >>>function in Delphi a great deal because you get the adress of the bitboard >>>directly in the EAX register, don't know about C though but perhaps it will be >>>the same or perhaps its just a thing of the Delphi compiler. >> >> >>Yes, but even if the pointer is passed via a register (fastcall calling >>convention in C is via ECX or EDX in 32-bit mode) your bitboard has to be in >>memory - even if it is already in a register pair. > >That's true, it has to be a variable that is passed. If i have to scan a >constant bitboard or a result from a logical expression, i use another function. >Most of the times though, its a variable. > >> >>Does Delfi inline short functions? Or is it allways a function call? >> > >Not sure what you mean with 'short function'. I'm not familiar with C. Something like up to two cachelines or 128 bytes. AMD says something about 25 machine instructions. ------------------------------------------------------------------------------ Software Optimization Guide for AMD64 Processors 25112 Rev. 3.06 September 2005 7.3 Inline Functions Optimization Use function inlining when: • A function is called from just one site in the code. (For the C language, determination of this characteristic is made easier if functions are explicitly declared static unless they require external linkage.) • A function—once inlined—contains fewer than 25 machine instructions. Application This optimization applies to: • 32-bit software • 64-bit software Rationale There are advantages and disadvantages to function inlining. On the one hand, function inlining eliminates function-call overhead and allows better register allocation and instruction scheduling at the site of the function call. The disadvantage of function inlining is decreased code reference locality, which can increase execution time due to instruction cache misses. For functions that create fewer than 25 machine instructions once inlined, it is likely that the functioncall overhead is close to, or more than, the time spent executing the function body. In these cases, function inlining is recommended. Function-call overhead on the AMD Athlon 64 and AMD Opteron processors can be low because calls and returns are executed very quickly due to the use of prediction mechanisms. However, there is still overhead due to passing function arguments through memory, which creates store-to-loadforwarding dependencies. (In 64-bit mode, this overhead is typically avoided by passing more arguments in registers, as specified in the AMD64 Application Binary Interface [ABI] for the operating system.) For longer functions, the benefits of reduced function-call overhead give diminishing returns. A function that results in the insertion of more than 500 machine instructions at the call site should probably not be inlined. Some larger functions might consist of multiple, relatively short paths that are negatively affected by function overhead. In such a case, it can be advantageous to inline larger functions. Profiling information is the best guide in determining whether to inline such large functions. ------------------------------------------------------------------------------ Lets take this loop to show the drawbacks for msc inline-assembly: (gcc is smarter here, but has ata-syntax which is imho a bit ugly). BitBoard one = 1; BitBoard bb = ..; while (bb) { int i = bitScanReverse(bb); ... bb ^= one << i; // reset bit } The loop is time critical and i don't like to call a subroutine with pushparams/call/retx overhead each time inside the loop body. If a subroutine is relative short (ok, i have 36 instructions here), the push/call/ret-overhead becomes relative huge if all params are already inside a processor register. In C you can use macros - or in C/C++ inline functions, implicitly and explicitely with special qualifiers such as inline, __inline or __forceinline. The compiler usually (there are exceptions, if the inlined fuctions is too huge or incarnated too often) reproduces or incarnates the whole code of the callee inside the callers code. Total code becomes larger but usually faster due to big trace- or L1-code caches. An aspect of inlining is that otherwise sequential functions may pair better since two independent instruction chains may be interlaced scheduled by the compiler with less dependencies and better ipc. For instance if you have two independent chains with only an ipc of one each, instead of the max of three - and you have enough registers free (which might be a problem with x86/32) - then it is possible to improve ipc up to two and to "double" the performance due to better "parallel" work... Don't confuse inline-asm with inlining in C in general. That are disjoint properties. An inline-asm functions may be inlined or not - the second implies the "usual" function call. Here some generated assembly of the msc6-compiler with this inlined inline-asm: __forceinline unsigned int bitScanReverse(BitBoard bb) { __asm { bsr eax,[bb] bsr eax,[bb+4] setnz dl shl dl,5 xor al,dl } } ... int i = bitScanReverse(bb); ... 0000d 89 7c 24 10 mov DWORD PTR _bb$[esp+24], edi 00011 89 6c 24 14 mov DWORD PTR _bb$[esp+28], ebp 00015 bb 01 00 00 00 mov ebx, 1 0001a 8d 9b 00 00 00 00 npad 6 $L665: 00020 0f bd 44 24 10 bsr eax, DWORD PTR _bb$[esp+24] 00025 0f bd 44 24 14 bsr eax, DWORD PTR _bb$[esp+28] 0002a 0f 95 c2 setne dl 0002d c0 e2 05 shl dl, 5 00030 32 c2 xor al, dl ... 00057 89 7c 24 10 mov DWORD PTR _bb$[esp+24], edi 0005b 89 6c 24 14 mov DWORD PTR _bb$[esp+28], ebp ... jxx $L665 You see the Bitboard is already in edi, ebp. The compiler inlines the bsf-based routine, but the "fixed" assembly requires passing the bitboard via stack - and the compiler is not able to "optimize" the redundant store/loads away like in this fragment. mov ebx, 1 $L665: bsr eax, edi bsr eax, ebp setne dl shl dl, 5 xor al, dl ... jxx $L665 Inlining or not inlining does not matter so much in speed with this assembly function - 29 cycles inlined versus 31 cycles not inlined. The C-routine has either 23 cycles inlined versus 34 cycles not inlined. Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.