Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Avoiding if ( expr ) and branch misprediction ?

Author: Gerd Isenberg

Date: 12:04:07 01/19/06

Go up one level in this thread


On January 19, 2006 at 14:46:13, Robert Hyatt wrote:

>On January 19, 2006 at 14:26:11, Gerd Isenberg wrote:
>
>>On January 19, 2006 at 12:45:10, Chan Rasjid wrote:
>>
>>>
>>>When using bitboards, often we have arrays that require only indices 0 / 1.
>>>eg if we have no underpromotions, then there is either 1/2 knights per color.
>>>Sometimes ,in such situations, codes may be re-written without using the if ()
>>>statement. But whether such simple tricks can avoid the cost of branch
>>>misprediction ( I only have a vague idea about this ).
>>>
>>>X[i] where i = 0 / 1;
>>>
>>>1) if ( a & b ){
>>>     j = X[1];
>>>   }else{
>>>     j = X[0];
>>>   }
>>>   //etc.. codes dependent on j.
>>>
>>>2) Now without if() :-
>>>
>>>   j = X[(a & b)!= 0];
>>>   //etc.. codes dependent on j.
>>
>>Hi Rasjid,
>>
>>whether a boolean statement works without branches depends on the compiler and
>>architecture. Recent x86 compiler should do at least 2) branchless, using a
>>setnz-instruction for (a & b)!= 0, for instance:
>>
>>xor   edx, edx ; clear dword
>>test  rax, rbx ; get the zero flag
>>setnz dl       ; edx := (a & b) ? 1 : 0
>>mov   edx, X[edx*4] ; assuming X is int array
>>
>>>
>>>How (if ? ) can such a simple re-coding cause an improvement. Can there
>>>be some situations related to the above with significanr misprediction
>>>overheads ?
>>
>>If the pattern is hard to predict and/or you have so many branches, that the
>>branch target buffer is likely to get trashed, i think it is a good idea to
>>avoid conditional jumps. Specially with very small conditional bodies in small,
>>but often used inlined functions or macros.
>
>One note.  A good branch predictor (I have not looked at PIV carefully) has the
>ability to do both "correlated and non-correlated prediction" with a tournament
>predictor.  The idea is that if you have lots of if (wtm) type things, they are
>all "correlated".  If one is taken, all are taken, and vice-versa.  That can be
>handled (Dec did this on the alpha).


Yes interesting - not sure about that on amd boxes.


>
>>
>>With pattern like 99% of the time 00010001... (for instance 0=then case taken,
>>1=else case taken) conditional jumps are fine. You may try to avoid else cases
>>if possible as well - or leave that and possible branchless tricks to the
>>"pogoable" compiler.
>>
>>While x86 branch replace instructions like setcc, cmovcc, sbb and adc are flag
>>dependent, they don't allow a more dense instruction scheduling for multiple
>>conditional assignments. That is not true for "sign flag" dependency after a
>>subtraction, where the sign flag is also present as MSB inside a register. So
>>one may use logical shift right 31 to get boolean {0,1}-values or arithmetical
>>shift right 31 (or cdq) for {0,-1}-masks, if you are aware of the range of the
>>signed or unsigned values to subtract, for instance:
>>
>>if ( a > b ) x += 7;
>>if ( c > d ) x += 5;
>>
>>// use < 0 expressions - considering ranges!
>>// this unsafe optimization can not be done by the compiler!
>>if ( (signed)(b-a) < 0 ) x += 7;
>>if ( (signed)(d-c) < 0 ) x += 5;
>>
>>// the branchless approach with
>>// some prospects of parallel work
>>x += ((signed)(b-a)>>31) & 7;
>>x += ((signed)(d-c)>>31) & 5;
>>
>>possible assembly:
>>
>>sub   ebx, eax ; b-a
>>sub   edx, ecx ; d-c
>>sar   ebx, 31
>>sar   edx, 31
>>and   ebx, 7
>>and   edx, 5
>>add   r08, ebx ; x += ((signed)(b-a)>>31) & 7
>>add   r08, edx ; x += ((signed)(d-c)>>31) & 5
>>
>>Pairs quite well, but P4 sucks due to slow shift.
>>On amd that will take about the same number of cycles as one sequential
>>conditional assignment. If processing vectors of this stuff, sse2 comes in mind.
>>
>>Gerd
>>
>
>I like the fact that you are now a "rax" person.  :)

Oups, r08 is probably wrong here - how is the syntax of using partial 32-, 16-
or 8-bit partial r08-r15 registers?

Yes, 64-bit register also have additional advantage by using the
test-instructions performing and-conditions without trashing registers which is
necessary in 32-bit mode via and/and/or.


>Of course, don't try that
>at home on your old X86 system.  I'm waiting on our new intel X86-64 cluster to
>"come alive" to see how intel's latest X86-64 stuff actually works.  Will
>provide some data before long I hope...

Curious about the shift performance on those intel boxes.
P4 was so awful by using the mmx-unit.


>
>
>>>
>>>Thanks
>>>Rasjid



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.