Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: some quotes on switch and indirect branches

Author: Vincent Diepeveen

Date: 12:45:49 11/23/05

Go up one level in this thread


On November 22, 2005 at 20:13:04, Eugene Nalimov wrote:

>On November 22, 2005 at 19:07:16, Dieter Buerssner wrote:
>
>>On November 21, 2005 at 20:00:24, Eugene Nalimov wrote:
>>
>>>On November 21, 2005 at 18:10:54, Dieter Buerssner wrote:
>>>
>>>>[...]
>>>>I guess, you mean this as a substitution for
>>>>  if (depth < 0)
>>>>    fm = fm1;
>>>>  else
>>>>    fm = fm2;
>>>>
>>>>I am surprised, that compilers are not able to do this themselves. I
>>>
>>>I several times tried to modify Visual C to recognize additional cases where we
>>>should emit conditional moves (last time was probably a year ago for
>>>x64-targeting compiler). Every time I could demonstrate win on a small
>>>artificial test case, but every large real world program either showed no gain
>>>or slowed down.
>>>
>>>I suspect there are several reasons for this:
>>>* branch predictors are good, and majority of branches can be correctly
>>>predicted
>>>* CMOV is long instruction; short branch is shorter, so program with less CMOVs
>>>fits better into cache
>>>* there is no 8-bit form of CMOV
>>>* >* for invalid address "CMOV reg, memory" will give you access violation even if
>>>condition is false.
>>
>>I don't really understand the last reason.
>>
>>It might be possible, to detect (guess) cases, that are the real inner loops.
>>That should practically get rid of "CMOV is long instruction; short branch is
>>shorter, so program with less CMOVs fits better into cache".
>
>For lot of programs there are no inner loops where program spends majority of
>its time. Examples are operating systems, databases, compilers, office
>applications, etc. For such programs profile is flat -- i.e. you don't have 3
>functions that collectively use (say) 70% of total execution time. Instead you
>have 150 "hot" functions, where hottest takes less than 2% of total time, and
>majority take 0.5-1%. There are no hot loops -- typical number of iterations is
>less than 5. That is very compiler-unfriendly situation; you cannot just
>generate locally optimal code ignoring its size. For such cases best approach is
>to generate reasonable (though not locally fastest) code, and carefully try to
>generate smallest possible code. You have to carefuly limit optimizations
>increasing code size such as loop unrolling, inlining, etc.
>
>That is exactly situation where Visual C shines -- our main customer is
>Microsoft itself. Windows, IE, MS Office, MS SQL Server, etc. are all such
>applications. I heard one anecdotal evidence about customer who was unhappy with
>code we generates for their server application, so they spend lot of resources
>migrating to the compiler provided by (famous) CPU vendor. Resulting executable
>run much slower than executable produced by Visual C; difference was tens of
>percents.
>
>Generating CMOVs for such applications can grow code size by (say) 1-2%. It
>happens that less branch mispredicts does not compensate for more frequent
>I-cache misses.
>
>(The most pathological case I analyzed was one of the SpecInt2k benchmarks.
>Hottest function in the program is huge recursive function. It containes one
>loop. That loop contains switch statement, and there are lot of recursive calls
>inside that switch. Average number of loop iterations is less than 2. As a
>result, the more code you hoist out of the loop, the slower program becomes --
>function prologue is (almost) hottest code in the function).
>
>>About "there is no 8-bit form of CMOV". If that really hurts, it should be
>>rather easy, to only use them in the other cases?
>
>That is exactly what compilers are doing. I was just pointing that there are
>omissions making CMOVs less useful.
>
>>Thanks for your interesting input,
>>Dieter

Are you sure that the only reasons to not use CMOV's are the above reasons.
Isn't another real important reason the hard fact that it's just 2 cycles at AMD
hardware versus 7 cycles at Prescott?

That means a 3.5 times speed advantage of AMD over Intel.

Intel C++ mercilous *does* generate CMOV's when you optimize for other cpu's
than P4. M$ hardly is generating them, nor having an optimization, as far as i
know that forces to use them.

Diep is faster with CMOV's than without.

Some prime searching code i've got here is 2 times faster nearly with CMOV's.

GCC generates fastest code for it of all compilers. Intel c++ comes close when
using pentium-m optimizations and generating code for P3 compatible type cpu's.

Visual c++ is 50% slower there because it simply doesn't want to generate CMOV's
*ever*.

Wintel.





This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.