Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Avoiding if ( expr ) and branch misprediction ?

Author: Robert Hyatt

Date: 18:17:50 01/19/06

Go up one level in this thread


On January 19, 2006 at 15:04:07, Gerd Isenberg wrote:

>On January 19, 2006 at 14:46:13, Robert Hyatt wrote:
>
>>On January 19, 2006 at 14:26:11, Gerd Isenberg wrote:
>>
>>>On January 19, 2006 at 12:45:10, Chan Rasjid wrote:
>>>
>>>>
>>>>When using bitboards, often we have arrays that require only indices 0 / 1.
>>>>eg if we have no underpromotions, then there is either 1/2 knights per color.
>>>>Sometimes ,in such situations, codes may be re-written without using the if ()
>>>>statement. But whether such simple tricks can avoid the cost of branch
>>>>misprediction ( I only have a vague idea about this ).
>>>>
>>>>X[i] where i = 0 / 1;
>>>>
>>>>1) if ( a & b ){
>>>>     j = X[1];
>>>>   }else{
>>>>     j = X[0];
>>>>   }
>>>>   //etc.. codes dependent on j.
>>>>
>>>>2) Now without if() :-
>>>>
>>>>   j = X[(a & b)!= 0];
>>>>   //etc.. codes dependent on j.
>>>
>>>Hi Rasjid,
>>>
>>>whether a boolean statement works without branches depends on the compiler and
>>>architecture. Recent x86 compiler should do at least 2) branchless, using a
>>>setnz-instruction for (a & b)!= 0, for instance:
>>>
>>>xor   edx, edx ; clear dword
>>>test  rax, rbx ; get the zero flag
>>>setnz dl       ; edx := (a & b) ? 1 : 0
>>>mov   edx, X[edx*4] ; assuming X is int array
>>>
>>>>
>>>>How (if ? ) can such a simple re-coding cause an improvement. Can there
>>>>be some situations related to the above with significanr misprediction
>>>>overheads ?
>>>
>>>If the pattern is hard to predict and/or you have so many branches, that the
>>>branch target buffer is likely to get trashed, i think it is a good idea to
>>>avoid conditional jumps. Specially with very small conditional bodies in small,
>>>but often used inlined functions or macros.
>>
>>One note.  A good branch predictor (I have not looked at PIV carefully) has the
>>ability to do both "correlated and non-correlated prediction" with a tournament
>>predictor.  The idea is that if you have lots of if (wtm) type things, they are
>>all "correlated".  If one is taken, all are taken, and vice-versa.  That can be
>>handled (Dec did this on the alpha).
>
>
>Yes interesting - not sure about that on amd boxes.
>
>
>>
>>>
>>>With pattern like 99% of the time 00010001... (for instance 0=then case taken,
>>>1=else case taken) conditional jumps are fine. You may try to avoid else cases
>>>if possible as well - or leave that and possible branchless tricks to the
>>>"pogoable" compiler.
>>>
>>>While x86 branch replace instructions like setcc, cmovcc, sbb and adc are flag
>>>dependent, they don't allow a more dense instruction scheduling for multiple
>>>conditional assignments. That is not true for "sign flag" dependency after a
>>>subtraction, where the sign flag is also present as MSB inside a register. So
>>>one may use logical shift right 31 to get boolean {0,1}-values or arithmetical
>>>shift right 31 (or cdq) for {0,-1}-masks, if you are aware of the range of the
>>>signed or unsigned values to subtract, for instance:
>>>
>>>if ( a > b ) x += 7;
>>>if ( c > d ) x += 5;
>>>
>>>// use < 0 expressions - considering ranges!
>>>// this unsafe optimization can not be done by the compiler!
>>>if ( (signed)(b-a) < 0 ) x += 7;
>>>if ( (signed)(d-c) < 0 ) x += 5;
>>>
>>>// the branchless approach with
>>>// some prospects of parallel work
>>>x += ((signed)(b-a)>>31) & 7;
>>>x += ((signed)(d-c)>>31) & 5;
>>>
>>>possible assembly:
>>>
>>>sub   ebx, eax ; b-a
>>>sub   edx, ecx ; d-c
>>>sar   ebx, 31
>>>sar   edx, 31
>>>and   ebx, 7
>>>and   edx, 5
>>>add   r08, ebx ; x += ((signed)(b-a)>>31) & 7
>>>add   r08, edx ; x += ((signed)(d-c)>>31) & 5
>>>
>>>Pairs quite well, but P4 sucks due to slow shift.
>>>On amd that will take about the same number of cycles as one sequential
>>>conditional assignment. If processing vectors of this stuff, sse2 comes in mind.
>>>
>>>Gerd
>>>
>>
>>I like the fact that you are now a "rax" person.  :)
>
>Oups, r08 is probably wrong here - how is the syntax of using partial 32-, 16-
>or 8-bit partial r08-r15 registers?

I honestly don't know, because I did not try to see if such a thing was
possible.  The things I did in X86-64 asm were for 64 bit operations, so rax or
r8 were just fine.  No idea what can be done with respect to 32 bit partial
registers for the new ones...


>
>Yes, 64-bit register also have additional advantage by using the
>test-instructions performing and-conditions without trashing registers which is
>necessary in 32-bit mode via and/and/or.
>
>
>>Of course, don't try that
>>at home on your old X86 system.  I'm waiting on our new intel X86-64 cluster to
>>"come alive" to see how intel's latest X86-64 stuff actually works.  Will
>>provide some data before long I hope...
>
>Curious about the shift performance on those intel boxes.
>P4 was so awful by using the mmx-unit.


Should get some data early next week...


>
>
>>
>>
>>>>
>>>>Thanks
>>>>Rasjid



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.