Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Branchless code

Author: Robert Hyatt
Date: 15:08:59 11/19/02
On November 19, 2002 at 17:11:33, Eugene Nalimov wrote:

>Sorry, probably too much details were lost during explanation. Here is simple
>example:
>
>int foo (int i)
>{
>    switch (i) {
>    case 10:
>        i += 1;
>        break;
>    case 11:
>        i += 2;
>        break;
>    case 12:
>        i += 3;
>        break;
>    case 13:
>        i += 4;
>        break;
>    case 14:
>        i += 5;
>        break;
>    case 15:
>        i += 6;
>        break;
>    case 16:
>        i += 7;
>        break;
>    case 17:
>        i += 8;
>        break;
>    }
>    return bar(i*i);
>}
>
>And here is what compiler can generate for it (I compiled it with -O1 flag,
>optimizing for space, to simplify some things):
>
>_i$ = 8							; size = 4
>_foo	PROC NEAR					; COMDAT
>	mov	eax, DWORD PTR _i$[esp-4]
>	lea	ecx, DWORD PTR [eax-10]
>	cmp	ecx, 7
>	ja	SHORT $L473
>	jmp	DWORD PTR $L494[ecx*4]
>$L476:
>	inc	eax
>	jmp	SHORT $L473
>$L477:
>	inc	eax
>	inc	eax
>	jmp	SHORT $L473
>$L478:
>	add	eax, 3
>	jmp	SHORT $L473
>$L479:
>	add	eax, 4
>	jmp	SHORT $L473
>$L480:
>	add	eax, 5
>	jmp	SHORT $L473
>$L481:
>	add	eax, 6
>	jmp	SHORT $L473
>$L482:
>	add	eax, 7
>	jmp	SHORT $L473
>$L483:
>	add	eax, 8
>$L473:
>	mov	ecx, eax
>	imul	ecx, eax
>	push	ecx
>	call	_bar
>	pop	ecx
>	ret	0
>$L494:
>	DD	$L476
>	DD	$L477
>	DD	$L478
>	DD	$L479
>	DD	$L480
>	DD	$L481
>	DD	$L482
>	DD	$L483
>_foo	ENDP
>
>So, we have 2 branches -- one for the case when value is out of the range, and
>second is the indirect branch that is very heard to predict. On VC you can
>explain the compiler that there are no out-of-range values and get rid of the
>fisrt branch, but indirect branch is there to stay.
>
>If you are optimizing for speed and using profile feedback, sometimes compiler
>can get rid of the indirect branch for most frequent cases. I.e. if compiler
>knows that in 90% of cases i equals 12, than it can generate something like
>
>_i$ = 8							; size = 4
>_foo	PROC NEAR					; COMDAT
>	mov	eax, DWORD PTR _i$[esp-4]
>        cmp     eax, 12
>        jne     L1
>	add	eax, 3
>$L473:
>	mov	ecx, eax
>	imul	ecx, eax
>	push	ecx
>	call	_bar
>	pop	ecx
>	ret	0
>L1:
>	lea	ecx, DWORD PTR [eax-10]
>	cmp	ecx, 7
>	ja	SHORT $L473
>	jmp	DWORD PTR $L494[ecx*4]
>$L476:
>	inc	eax
>	jmp	SHORT $L473
>$L477:
>	add	eax, 2
>	jmp	SHORT $L473
>$L478:
>	add	eax, 3
>	jmp	SHORT $L473
>$L479:
>	add	eax, 4
>	jmp	SHORT $L473
>$L480:
>	add	eax, 5
>	jmp	SHORT $L473
>$L481:
>	add	eax, 6
>	jmp	SHORT $L473
>$L482:
>	add	eax, 7
>	jmp	SHORT $L473
>$L483:
>	add	eax, 8
>	jmp	SHORT $L473
>$L494:
>	DD	$L476
>	DD	$L477
>	DD	0
>	DD	$L479
>	DD	$L480
>	DD	$L481
>	DD	$L482
>	DD	$L483
>_foo	ENDP
>
>Now for most often (and almost always predicted) case there is no branch
>misprediction, but for other cases there are 2 mispredictions. If we assume that
>misprediction penalty is 15 cycles, and we don't call function with out-of-range
>value, for the "naive" code switch cost is 5 instructions, one of them is
>mispredicted branch, so total is ~20 cycles. For the "profile-guided" code
>switch cost is 3 instructions (no mispredicted branches) in 90% of cases, and 7
>instructions (2 mispredicted branches) in 10% of cases, so total switch cost is
>0.9*3 + 0.1*37 == less than 7 cycles.
>
>Thanks,
>Eugene

Does it handle the case where it is often more efficient to do something, then
undo it
if necessary?  IE the usual if-then-else case where you replace

if (c)
  do x;
else
  do y;

with

do y
if (c) {
  undo y
  do x
}

A lot of this seems awfully hard for the optimizer to handle, but I haven't
looked at MSVC
code in a long while.  gcc is ok...  but _only_ ok...




>
>
>On November 19, 2002 at 16:41:11, Robert Hyatt wrote:
>
>>On November 19, 2002 at 16:29:41, Eugene Nalimov wrote:
>>
>>>You don't need asm file to figure out what is going on. For a large switch table
>>>compiler generated jump table, and code should look like
>>>    cmp eax, upper_switch_value+1
>>>    jnc default_case
>>>    mov eax, jump_table[eax*4]
>>>    jmp eax
>>>Here indirect branch is not better than indirect function call. Plus, as Gunnar
>>>mentioned, there is some overhead.
>>
>>
>>What about a jump to a jump instead?  That ought to be more predictable as the
>>second jump will _always_ be taken and to the same target address.  IE the first
>>"switch" I saw did something like this:
>>
>>assuming ebx has the switch case (0,1,..., N-1)
>><check range>
>>lea  eax, jumptable+ebx*4
>>jmp [eax]
>>
>>And jumptable looked like this:
>>
>>jumptable:  jmp   case0
>>                   jmp   case1
>>
>>etc...
>>
>>That is loosely translated into x86 assembly as this was not an x86
>>architecture,
>>and yes, the ebx*4 is not quite the right number, but it was the idea rather
>>than
>>trying to adjust ebx by a non-power-of-2 and I chose the easy way for
>>illustration...
>>
>>the jmp [eax] can't be predicted, but the rest in the jump table can be.
>>
>>
>>
>>>
>>>Thanks,
>>>Eugene
>>>
>>>On November 19, 2002 at 14:40:50, Robert Hyatt wrote:
>>>
>>>>On November 19, 2002 at 04:41:59, Gunnar Andersson wrote:
>>>>
>>>>>On November 18, 2002 at 20:09:27, Eugene Nalimov wrote:
>>>>>
>>>>>>Indirect call costs *at least* the same as conditional branch, sometimes much
>>>>>>more.
>>>>>
>>>>>I've discovered that arrays of function pointers can outperform switch
>>>>>statements, at least with GCC and many cases.  To me the generated assembly code
>>>>>for the switch statement contains a jump table anyway, and there's some extra
>>>>>assembly code essentially performing range checking.
>>>>>
>>>>>/ Gunnar
>>>>
>>>>
>>>>It would be interesting to look at the .s file from gcc to see what it is doing.
>>>> The
>>>>indirect calls should be bad.  Of course the range checking in a switch is also
>>>>bad,
>>>>but I have seen compilers that would let me turn this off...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.