Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: some quotes on switch and indirect branches

Author: Gerd Isenberg
Date: 15:14:50 11/23/05
On November 23, 2005 at 15:45:49, Vincent Diepeveen wrote:

>On November 22, 2005 at 20:13:04, Eugene Nalimov wrote:
>
>>On November 22, 2005 at 19:07:16, Dieter Buerssner wrote:
>>
>>>On November 21, 2005 at 20:00:24, Eugene Nalimov wrote:
>>>
>>>>On November 21, 2005 at 18:10:54, Dieter Buerssner wrote:
>>>>
>>>>>[...]
>>>>>I guess, you mean this as a substitution for
>>>>>  if (depth < 0)
>>>>>    fm = fm1;
>>>>>  else
>>>>>    fm = fm2;
>>>>>
>>>>>I am surprised, that compilers are not able to do this themselves. I
>>>>
>>>>I several times tried to modify Visual C to recognize additional cases where we
>>>>should emit conditional moves (last time was probably a year ago for
>>>>x64-targeting compiler). Every time I could demonstrate win on a small
>>>>artificial test case, but every large real world program either showed no gain
>>>>or slowed down.
>>>>
>>>>I suspect there are several reasons for this:
>>>>* branch predictors are good, and majority of branches can be correctly
>>>>predicted
>>>>* CMOV is long instruction; short branch is shorter, so program with less CMOVs
>>>>fits better into cache
>>>>* there is no 8-bit form of CMOV
>>>>* >* for invalid address "CMOV reg, memory" will give you access violation even if
>>>>condition is false.
>>>
>>>I don't really understand the last reason.
>>>
>>>It might be possible, to detect (guess) cases, that are the real inner loops.
>>>That should practically get rid of "CMOV is long instruction; short branch is
>>>shorter, so program with less CMOVs fits better into cache".
>>
>>For lot of programs there are no inner loops where program spends majority of
>>its time. Examples are operating systems, databases, compilers, office
>>applications, etc. For such programs profile is flat -- i.e. you don't have 3
>>functions that collectively use (say) 70% of total execution time. Instead you
>>have 150 "hot" functions, where hottest takes less than 2% of total time, and
>>majority take 0.5-1%. There are no hot loops -- typical number of iterations is
>>less than 5. That is very compiler-unfriendly situation; you cannot just
>>generate locally optimal code ignoring its size. For such cases best approach is
>>to generate reasonable (though not locally fastest) code, and carefully try to
>>generate smallest possible code. You have to carefuly limit optimizations
>>increasing code size such as loop unrolling, inlining, etc.
>>
>>That is exactly situation where Visual C shines -- our main customer is
>>Microsoft itself. Windows, IE, MS Office, MS SQL Server, etc. are all such
>>applications. I heard one anecdotal evidence about customer who was unhappy with
>>code we generates for their server application, so they spend lot of resources
>>migrating to the compiler provided by (famous) CPU vendor. Resulting executable
>>run much slower than executable produced by Visual C; difference was tens of
>>percents.
>>
>>Generating CMOVs for such applications can grow code size by (say) 1-2%. It
>>happens that less branch mispredicts does not compensate for more frequent
>>I-cache misses.
>>
>>(The most pathological case I analyzed was one of the SpecInt2k benchmarks.
>>Hottest function in the program is huge recursive function. It containes one
>>loop. That loop contains switch statement, and there are lot of recursive calls
>>inside that switch. Average number of loop iterations is less than 2. As a
>>result, the more code you hoist out of the loop, the slower program becomes --
>>function prologue is (almost) hottest code in the function).
>>
>>>About "there is no 8-bit form of CMOV". If that really hurts, it should be
>>>rather easy, to only use them in the other cases?
>>
>>That is exactly what compilers are doing. I was just pointing that there are
>>omissions making CMOVs less useful.
>>
>>>Thanks for your interesting input,
>>>Dieter
>
>Are you sure that the only reasons to not use CMOV's are the above reasons.
>Isn't another real important reason the hard fact that it's just 2 cycles at AMD
>hardware versus 7 cycles at Prescott?
>
>That means a 3.5 times speed advantage of AMD over Intel.


cmov reg32,reg32 latenvy is even only one cycle direct path according to the
optimization guide. But like adc/sbb it is dependend on three operands.
I guess Agner Fog's statement for cmov on P4 is even true for AMD64.

---------------------------------------------------------------------------
The misprediction penalty for a branch may be so high that it is advantageous to
replace it with conditional moves even when it costs several extra instructions.
But conditional move instructions have two important disadvantages:
1. they make dependence chains longer
2. they introduce an unnecessary dependence on the operand not chosen

A conditional move is waiting for three operands to be ready before it can be
executed: the condition flag and the two move operands. You have to consider if
any of these three operands are likely to be delayed by dependence chains or
cache misses. If the condition flag is available long before the move operands
then you may as well use a branch, because a possible branch misprediction could
be resolved while waiting for the move operands. In situations where you have to
wait long for a move operand that may not be needed after all, the branch may be
faster than the conditional move despite a possible misprediction penalty. The
opposite situation is when the condition flag is delayed while both move
operands are available early. In this situation the conditional move is
preferred over the branch if misprediction is likely.
---------------------------------------------------------------------------

>
>Intel C++ mercilous *does* generate CMOV's when you optimize for other cpu's
>than P4. M$ hardly is generating them, nor having an optimization, as far as i
>know that forces to use them.
>
>Diep is faster with CMOV's than without.

So many conditional assignments on random conditions?

May be not (only) due to miss predicted branches but also due to some typical
branchless msvc(6) cmov-"emulation", using setXX instead of cmovXX?

xor	 eax, eax
test	 ecx, ecx
setge	 al
dec	 eax
and	 eax, -100
add	 eax, 300

instead of

cmp	 ecx, 0
mov	 ecx, 300
mov	 eax, 200
cmovl	 eax, ecx

>
>Some prime searching code i've got here is 2 times faster nearly with CMOV's.
>
>GCC generates fastest code for it of all compilers. Intel c++ comes close when
>using pentium-m optimizations and generating code for P3 compatible type cpu's.
>
>Visual c++ is 50% slower there because it simply doesn't want to generate CMOV's
>*ever*.

Looks like an indication for a lot of branches on random input.

>
>Wintel. ;-)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.