Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Branchless code

Author: Eugene Nalimov
Date: 14:11:33 11/19/02
Sorry, probably too much details were lost during explanation. Here is simple
example:

int foo (int i)
{
    switch (i) {
    case 10:
        i += 1;
        break;
    case 11:
        i += 2;
        break;
    case 12:
        i += 3;
        break;
    case 13:
        i += 4;
        break;
    case 14:
        i += 5;
        break;
    case 15:
        i += 6;
        break;
    case 16:
        i += 7;
        break;
    case 17:
        i += 8;
        break;
    }
    return bar(i*i);
}

And here is what compiler can generate for it (I compiled it with -O1 flag,
optimizing for space, to simplify some things):

_i$ = 8							; size = 4
_foo	PROC NEAR					; COMDAT
	mov	eax, DWORD PTR _i$[esp-4]
	lea	ecx, DWORD PTR [eax-10]
	cmp	ecx, 7
	ja	SHORT $L473
	jmp	DWORD PTR $L494[ecx*4]
$L476:
	inc	eax
	jmp	SHORT $L473
$L477:
	inc	eax
	inc	eax
	jmp	SHORT $L473
$L478:
	add	eax, 3
	jmp	SHORT $L473
$L479:
	add	eax, 4
	jmp	SHORT $L473
$L480:
	add	eax, 5
	jmp	SHORT $L473
$L481:
	add	eax, 6
	jmp	SHORT $L473
$L482:
	add	eax, 7
	jmp	SHORT $L473
$L483:
	add	eax, 8
$L473:
	mov	ecx, eax
	imul	ecx, eax
	push	ecx
	call	_bar
	pop	ecx
	ret	0
$L494:
	DD	$L476
	DD	$L477
	DD	$L478
	DD	$L479
	DD	$L480
	DD	$L481
	DD	$L482
	DD	$L483
_foo	ENDP

So, we have 2 branches -- one for the case when value is out of the range, and
second is the indirect branch that is very heard to predict. On VC you can
explain the compiler that there are no out-of-range values and get rid of the
fisrt branch, but indirect branch is there to stay.

If you are optimizing for speed and using profile feedback, sometimes compiler
can get rid of the indirect branch for most frequent cases. I.e. if compiler
knows that in 90% of cases i equals 12, than it can generate something like

_i$ = 8							; size = 4
_foo	PROC NEAR					; COMDAT
	mov	eax, DWORD PTR _i$[esp-4]
        cmp     eax, 12
        jne     L1
	add	eax, 3
$L473:
	mov	ecx, eax
	imul	ecx, eax
	push	ecx
	call	_bar
	pop	ecx
	ret	0
L1:
	lea	ecx, DWORD PTR [eax-10]
	cmp	ecx, 7
	ja	SHORT $L473
	jmp	DWORD PTR $L494[ecx*4]
$L476:
	inc	eax
	jmp	SHORT $L473
$L477:
	add	eax, 2
	jmp	SHORT $L473
$L478:
	add	eax, 3
	jmp	SHORT $L473
$L479:
	add	eax, 4
	jmp	SHORT $L473
$L480:
	add	eax, 5
	jmp	SHORT $L473
$L481:
	add	eax, 6
	jmp	SHORT $L473
$L482:
	add	eax, 7
	jmp	SHORT $L473
$L483:
	add	eax, 8
	jmp	SHORT $L473
$L494:
	DD	$L476
	DD	$L477
	DD	0
	DD	$L479
	DD	$L480
	DD	$L481
	DD	$L482
	DD	$L483
_foo	ENDP

Now for most often (and almost always predicted) case there is no branch
misprediction, but for other cases there are 2 mispredictions. If we assume that
misprediction penalty is 15 cycles, and we don't call function with out-of-range
value, for the "naive" code switch cost is 5 instructions, one of them is
mispredicted branch, so total is ~20 cycles. For the "profile-guided" code
switch cost is 3 instructions (no mispredicted branches) in 90% of cases, and 7
instructions (2 mispredicted branches) in 10% of cases, so total switch cost is
0.9*3 + 0.1*37 == less than 7 cycles.

Thanks,
Eugene


On November 19, 2002 at 16:41:11, Robert Hyatt wrote:

>On November 19, 2002 at 16:29:41, Eugene Nalimov wrote:
>
>>You don't need asm file to figure out what is going on. For a large switch table
>>compiler generated jump table, and code should look like
>>    cmp eax, upper_switch_value+1
>>    jnc default_case
>>    mov eax, jump_table[eax*4]
>>    jmp eax
>>Here indirect branch is not better than indirect function call. Plus, as Gunnar
>>mentioned, there is some overhead.
>
>
>What about a jump to a jump instead?  That ought to be more predictable as the
>second jump will _always_ be taken and to the same target address.  IE the first
>"switch" I saw did something like this:
>
>assuming ebx has the switch case (0,1,..., N-1)
><check range>
>lea  eax, jumptable+ebx*4
>jmp [eax]
>
>And jumptable looked like this:
>
>jumptable:  jmp   case0
>                   jmp   case1
>
>etc...
>
>That is loosely translated into x86 assembly as this was not an x86
>architecture,
>and yes, the ebx*4 is not quite the right number, but it was the idea rather
>than
>trying to adjust ebx by a non-power-of-2 and I chose the easy way for
>illustration...
>
>the jmp [eax] can't be predicted, but the rest in the jump table can be.
>
>
>
>>
>>Thanks,
>>Eugene
>>
>>On November 19, 2002 at 14:40:50, Robert Hyatt wrote:
>>
>>>On November 19, 2002 at 04:41:59, Gunnar Andersson wrote:
>>>
>>>>On November 18, 2002 at 20:09:27, Eugene Nalimov wrote:
>>>>
>>>>>Indirect call costs *at least* the same as conditional branch, sometimes much
>>>>>more.
>>>>
>>>>I've discovered that arrays of function pointers can outperform switch
>>>>statements, at least with GCC and many cases.  To me the generated assembly code
>>>>for the switch statement contains a jump table anyway, and there's some extra
>>>>assembly code essentially performing range checking.
>>>>
>>>>/ Gunnar
>>>
>>>
>>>It would be interesting to look at the .s file from gcc to see what it is doing.
>>> The
>>>indirect calls should be bad.  Of course the range checking in a switch is also
>>>bad,
>>>but I have seen compilers that would let me turn this off...
ASM masters: here another related question Dezhi Zhao 20:10:15 11/19/02
Re: Branchless code Robert Hyatt 15:08:59 11/19/02
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.