Author: Gerd Isenberg
Date: 08:15:32 11/23/05
Go up one level in this thread
On November 22, 2005 at 18:42:17, Dieter Buerssner wrote:
>On November 22, 2005 at 03:23:46, Gerd Isenberg wrote:
>
>>Wow, gcc - that's it. Shift is relative expensive on P4, something like four
>>cycles (shift alu of the mmx-unit is used with "long" data pathes). Shift is
>>cheap on amd cpus.
>
>Hi Gerd, thanks for all the info and explanations, also in the other posts. In
>this context I wonder, how shift compares to cmovx?
>
>Anyway, I think you have guessed part of my opinion (besides the totally
>technical discussion) - I would not care (anymore) to try to look whether
>if/else or ? : produce better code. The other "tricks" you mentioned I might
>consider differently. Eugene's message was also interesting in this context. No
>doubt, we can save some cycles, by microoptimizing for some specific
>environment. And it can be much fun, too. I don't doubt either, that it can be
>worthwhile (in the sense of productive) for chess engines (many examples you
>have shown for typical bitbase routines). But not in such "simple" cases, that
>started this discussion. The next compiler/hardware combination may behave
>different again. Here, I prefer to just trust the compiler. Of course, when the
>discussed code is a clear bottleneck (which probably is not the case), more
>effort is justified.
>
>Cheers,
>Dieter
Hi Dieter,
you are right of course - it is nice if you trust compilers - i do it also most
of the time. It is fun and probably my abnormal ambition to outsmart complilers
here and there. In this case not by assembly but on C-level.
Also in relation to other threads in CCC, programming and specially low level
programming issues became very rare - so as an assembler junkie i will take
every chance to elaborate on that issues ;-)
Compiler may take care on weird intel P4 stuff, where a reg32-shift is
implemented like...
movd mmx, reg32
psrad mmx, count
movd reg32, mmx
... (a bit exaggerated), but i would like to choose a code-generator for athlon
and amd64 with fast shift (as fast as most risc-like instructions, one cycle
direct path) and fast mul.
The initial issue was switch versus if-else-else. The conditional assignment was
only used as synonym or shortcut for if-else - and there was no intention to ask
for different generated code by compilers here (Funny that there is a difference
at all).
It was about if-elses, where different cases of depth in a search or qsearch
assign different constants to the same variable (e.g. futility margin) - so that
the whole switch or if-else may be substituted by some array access indexed by
depth (maybe plus some offset to avoid negative indices).
My suggestion was to use that simple and cheap branchless expression for that
f(depth)-function, based on syntactical C-level substitution, specially for
three cases, where two independent expressions, properly scheduled may be
computed in parallel - also considering value range issues like possible
over/underflows in the difference (INT_MIN <= fm[i]-fm[i+1] <= INT_MAX).
fm = depth < -1 ? fm1 : depth < 0 ? fm2 : fm3;
fm = fm3 + ( (depth >>31) & (fm2-fm3))
+ (((depth+1)>>31) & (fm1-fm2));
I agree with Eugene on the cmov issue, also mentioned in Agner Fog's
optimization paper http://www.agner.org/assem/pentopt.pdf. Processors are today
really good in predicting branches. For instance, one should avoid branchless
max based on cmov or sar for statements like ...
if ( score > alfa )
alfa = score;
... in a chess program, where the condition is most often false, if previous
score >= beta was already handled.
max(x,y) := x - ((x-y) & ((x-y)>>31));
min(x,y) := y + ((x-y) & ((x-y)>>31));
There are very rare cases where branchless sar-min or -max pay off, because in
opposite to the conditional f(depth)-assignment the additional code and
register-usage is much greater than using cmp, jcc-branches.
Cheers,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.