Author: Gerd Isenberg
Date: 12:04:07 01/19/06
Go up one level in this thread
On January 19, 2006 at 14:46:13, Robert Hyatt wrote:
>On January 19, 2006 at 14:26:11, Gerd Isenberg wrote:
>
>>On January 19, 2006 at 12:45:10, Chan Rasjid wrote:
>>
>>>
>>>When using bitboards, often we have arrays that require only indices 0 / 1.
>>>eg if we have no underpromotions, then there is either 1/2 knights per color.
>>>Sometimes ,in such situations, codes may be re-written without using the if ()
>>>statement. But whether such simple tricks can avoid the cost of branch
>>>misprediction ( I only have a vague idea about this ).
>>>
>>>X[i] where i = 0 / 1;
>>>
>>>1) if ( a & b ){
>>> j = X[1];
>>> }else{
>>> j = X[0];
>>> }
>>> //etc.. codes dependent on j.
>>>
>>>2) Now without if() :-
>>>
>>> j = X[(a & b)!= 0];
>>> //etc.. codes dependent on j.
>>
>>Hi Rasjid,
>>
>>whether a boolean statement works without branches depends on the compiler and
>>architecture. Recent x86 compiler should do at least 2) branchless, using a
>>setnz-instruction for (a & b)!= 0, for instance:
>>
>>xor edx, edx ; clear dword
>>test rax, rbx ; get the zero flag
>>setnz dl ; edx := (a & b) ? 1 : 0
>>mov edx, X[edx*4] ; assuming X is int array
>>
>>>
>>>How (if ? ) can such a simple re-coding cause an improvement. Can there
>>>be some situations related to the above with significanr misprediction
>>>overheads ?
>>
>>If the pattern is hard to predict and/or you have so many branches, that the
>>branch target buffer is likely to get trashed, i think it is a good idea to
>>avoid conditional jumps. Specially with very small conditional bodies in small,
>>but often used inlined functions or macros.
>
>One note. A good branch predictor (I have not looked at PIV carefully) has the
>ability to do both "correlated and non-correlated prediction" with a tournament
>predictor. The idea is that if you have lots of if (wtm) type things, they are
>all "correlated". If one is taken, all are taken, and vice-versa. That can be
>handled (Dec did this on the alpha).
Yes interesting - not sure about that on amd boxes.
>
>>
>>With pattern like 99% of the time 00010001... (for instance 0=then case taken,
>>1=else case taken) conditional jumps are fine. You may try to avoid else cases
>>if possible as well - or leave that and possible branchless tricks to the
>>"pogoable" compiler.
>>
>>While x86 branch replace instructions like setcc, cmovcc, sbb and adc are flag
>>dependent, they don't allow a more dense instruction scheduling for multiple
>>conditional assignments. That is not true for "sign flag" dependency after a
>>subtraction, where the sign flag is also present as MSB inside a register. So
>>one may use logical shift right 31 to get boolean {0,1}-values or arithmetical
>>shift right 31 (or cdq) for {0,-1}-masks, if you are aware of the range of the
>>signed or unsigned values to subtract, for instance:
>>
>>if ( a > b ) x += 7;
>>if ( c > d ) x += 5;
>>
>>// use < 0 expressions - considering ranges!
>>// this unsafe optimization can not be done by the compiler!
>>if ( (signed)(b-a) < 0 ) x += 7;
>>if ( (signed)(d-c) < 0 ) x += 5;
>>
>>// the branchless approach with
>>// some prospects of parallel work
>>x += ((signed)(b-a)>>31) & 7;
>>x += ((signed)(d-c)>>31) & 5;
>>
>>possible assembly:
>>
>>sub ebx, eax ; b-a
>>sub edx, ecx ; d-c
>>sar ebx, 31
>>sar edx, 31
>>and ebx, 7
>>and edx, 5
>>add r08, ebx ; x += ((signed)(b-a)>>31) & 7
>>add r08, edx ; x += ((signed)(d-c)>>31) & 5
>>
>>Pairs quite well, but P4 sucks due to slow shift.
>>On amd that will take about the same number of cycles as one sequential
>>conditional assignment. If processing vectors of this stuff, sse2 comes in mind.
>>
>>Gerd
>>
>
>I like the fact that you are now a "rax" person. :)
Oups, r08 is probably wrong here - how is the syntax of using partial 32-, 16-
or 8-bit partial r08-r15 registers?
Yes, 64-bit register also have additional advantage by using the
test-instructions performing and-conditions without trashing registers which is
necessary in 32-bit mode via and/and/or.
>Of course, don't try that
>at home on your old X86 system. I'm waiting on our new intel X86-64 cluster to
>"come alive" to see how intel's latest X86-64 stuff actually works. Will
>provide some data before long I hope...
Curious about the shift performance on those intel boxes.
P4 was so awful by using the mmx-unit.
>
>
>>>
>>>Thanks
>>>Rasjid
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.