Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: some quotes on switch and indirect branches

Author: Gerd Isenberg

Date: 12:03:50 11/25/05

Go up one level in this thread


<snip>
>>If you have a lot of code in your eval and elsewhere, where you simply do some
>>conditional add or sub - and the condition is rather random and therefor likely
>>to produce some mispredictions with this very short bodies - you are free to use
>>the boolean*int multiplication on C level. For Less/greater compares you are
>>able to convert the compare expression to (arithmeticExpression < 0) and to do a
>>shift arithmetic right with arithmeticExpression and (sizeof(int)*8-1),  to get
>>a {-1,0}-range - a mask to apply the boolean multiplication by bitwise and.
>>
>>From the initial conditional statement ...
>>
>>if ( distance1 < distance2 )
>>   eval += bonus;
>>
>>... via < 0 ...
>>
>>if ( distance1 - distance2 < 0 )
>>   eval += bonus;
>>
>>... to the final sub(lea)-sar-and expression:
>>
>>eval += ( (distance1 - distance2) >> 31 ) & bonus;
>
>Here and there i'm actually using that already. Very good
>that you mention it again, Gerd.
>
>I should use it perhaps more. Didn't systematically use it yet. Always
>considered compilers to solve my problem.
>
>Perhaps i should have less trust in Nalimov's CMOV implementation getting in the
>main source code tree for visual x64 2006 :)
>
>>
>>That's really cheap, if P4 don't cares.
>
>Forget Prescott.
>
>>It requires only one extra register - while cmov requires two.
>>If one argument is a constant, while the other is scaled by 2,4,8 it is
>>still one lea instruction to produce the difference in a register.
>>(Otherwise i prefere mov-add/sub because lea is two cycles on AMD64).
>>
>>Of course it is weird to sacrifice readability to avoid branches!
>
>That's one of the main concerns in my code and one of the most important reason
>why i have so many branches.

Yes.

>
>Like if you have code like this (randomly written down example):
>
> if( quickboard[sq_e3] == brook && quickboard[sq_e4] == bpawn
>  && quickboard[sq_e5] == bking )
>   s += 5;
>
>You can of course consider doing this:
>
>  int orall = ((quickboard[sq_e3]-brook)
>               |(quickboard[sq_e4]-bpawn)
>               |(quickboard[sq_e5]-bking)
>              );
>
>  if( !orall ) s += 5;
>
>Generating a CMOV by that.
>
>Avoiding a near sure misprediction somewhere.
>Like branch 1 might have normally spoken a prediction ratio of 85% to get
>not taken, still that means 15% gets misprediction rate which hurts *badly*.
>
>p.s. how to apply a shifting trick to the above?


The orall trick is aleady a nice improvement.
For the final condition the sar-trick is not (cheap) applicable, because you
need the sign bit set.

Since the probability seems already rather low, that orall becomes zero -
and s += 5 is almost not taken - i would leave the conditional jump here.
IMHO even a compiler should not produce a cmov here, with a conjunction of so
many low probability conditions. Otoh if you have a lot of that stuff...

What you may try (in other random cases) with equal- or unequal-compares, to use
the boolean result of the condition as an integer, since true == 1 and
false == 0, and to build a {-1,0}-, aka {0xffffffff,0}-mask by unary minus or
binary minus one:

if( !orall ) s += 5;
s += 5 & -(orall==0);
s += 5 & ((orall!=0) - 1);

This should usually produce some assembly like this, with setXX instruction,
e.g. with orall already in ecx:

xor  eax, eax ; zero eax
test ecx, ecx ; some bit set in orall?
setnz al      ; eax = orall != 0 ? 1 : 0
dec  eax      ; eax = orall != 0 ? 0 : -1
and  eax, 5
add  [s], eax

A bit more expensive than the sar trick - also due to the flag-depency and
general longer dependency chain. Of course the more neighboured conditional
instructions you replace that way, the better interleaved scheduling may help to
break dependencies and to improve ipc.

If you collect some more of your orall board patterns that way, you may also
hope on intel-C++ with sse2/3 integer-vectorizer by Aart J.C. Bik.

Imagine you have a vector orall[16] (or multiples) and a correspondend
weight-vetor of signed or unsigned int/short/char ( depending on your value
range) - then the intel compiler might be able to perform appropriate
sse2/3-instructions for you for the desired bool*int/short/char dot-product,
e.g.:

int orall[16];
int featureWeight[16] = {5,.....};

// compute your oralls with individual board patterns
...
// and do the dot product
for (i=0; i < 16; i++)
 s += -(orall[i]==0) & featureWeight[i];

You may also remember the ~40-cycle sse2 bool[64]*byte[64] dot-product with
"portable" sse2-intrinsics. bool[64]*ushort[64] is also not that expensive.


>
>>So for better readability and maintainability i suggest some macros or inliners:
>
>>__forceinline
>>int ifAlessBthenCelse0(int a, int b, int c) { ((a-b)>>31) & c; }
>>
>>eval += ifAlessBthenCelse0(distance1, distance2, bonus);-)
>
>bah, putting for once comments with your code would be more helpful :)


At least you have some central instances, where you can later change things in
on block. Why not some macros or inliners in a architecture dependent header
file? Replacing is error prone both ways - even if it is rather syntactical.


>
>>Take care not accidently to violate the us. patent by sun!
>
>:)
>
>>The "Apparatus for directing a parallel processing computing device to form an
>>absolute value of a signed value":
>>
>>abs(x) ::= ( x ^ (x>>31) ) - (x>>31)
>
>#ifdef INTEGERIS32BITS
>abs(x) :: = x & 0x7fffffff; // 32 bits int

Oups - yes that leaves positive numbers unchanged and yes, it makes negative
numbers positive - but not like abs is intended. We have two's complement with
integers in C (or at least guaranteed with unsigned int, but practically also
with size int - except some antique compilers for some antique hardware with
one's complement or whatever alus).

-x ::= ~x + 1
-x ::= ~(x - 1)

Note that all least significant bits including the "first" one (from right to
left) remain unchanged, while all other bits toggle.
So 4 with 8-bits is binary 0000 0100
while -4 is binary         1111 1100

Clearing the sign bit is fine with floats or doubles - but with integers your
abs(-1) produces 0x7fffffff or INT_MAX.

>#else
>..
>
>>
>>The {-1,0}-mask is used here to build the one's complement if negative, while
>>subtract minus one does the two's complement.
>>
>>http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=ptxt&S1=6073150&OS=6073150&RS=6073150
>>
>>May be intel's weird shift implementation in P4 was the revenge ;-)
>
>:)
>
>>What about to force cdq for x86 in C by casting to long long or __int64 and then
>>take the high dword where the sign extension occured as the mask? Does that
>>violate the patent?
>>
>>int abs(int x) {
>>  int m = (int)( (unsigned __int64)((__int64)x) >> 32 );
>>  return (x ^ m) - m;
>>}
>
>honestely, 64 bits is so slow and all those quads have server2003 installed
>which is 32 bits, so having a program that's 32 bits code is really advised.
>
>then you can show up quad every tournament...

I do not recommend it for 64-bits.

For 32-bit mode see the generated assembly - not that bad.
To force the cdq-instruction - sign extend from EAX into EDX:EAX.
For 64-bit mode it is the question whether the compiler uses same assembly with
CDQ, or - most likely i guess - a CDQE-instruction, to sign extend EAX into RAX.
Where the shift right 32 becomes real, instead of using edx.

May be this one is sufficent for 64-bit mode not to violate the patent :-)

abs(x) ::= ( x + (x>>31) ) ^ (x>>31)

Cheers,
Gerd

>
>>and the generated assembly:
>>
>>?abs@@YIHH@Z PROC NEAR			; abs, COMDAT
>>; _x$ = ecx
>>  00000	8b c1		 mov	 eax, ecx
>>  00002	99		 cdq
>>  00003	8b c2		 mov	 eax, edx
>>  00005	33 c1		 xor	 eax, ecx
>>  00007	2b c2		 sub	 eax, edx
>>  00009	c3		 ret	 0
>>?abs@@YIHH@Z ENDP			; abs
>>
<snip>



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.