Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: GCC annihilating VISUAL C++ ==> branchless code in 2003?

Author: Matt Taylor

Date: 12:53:34 02/28/03

Go up one level in this thread


On February 28, 2003 at 13:58:48, Vincent Diepeveen wrote:

>On February 28, 2003 at 11:13:03, Matt Taylor wrote:
>
>>On February 28, 2003 at 08:59:08, Vincent Diepeveen wrote:
>>
>>>On February 27, 2003 at 15:35:34, Russell Reagan wrote:
>>>
>><snip>
>>>I am bad however in reading gcc generated assembly (it looks SO VERY UGLY,
>>>similar to the new PGN format of chessbase) and it seems to me it is
>>>possible that this code can be further optimized. I see no need to put the
>>>board pointer in eax each time. It's using just 2 registers versus very old
>>>MSVC is already using 3.
>>>
>>>Means that at the Opteron and Itanium2 and such processors with more than 8
>>>GPRs, the GCC compiler will suck major ass of course. It doesn't even know how
>>>to use more than 2 registers!
>>>
>>>But in this example it is doing things *branchless*.
>>>
>>>So i can't actually wait for a visual c++ edition to use CMOV* instructions
>>>and using profile info to optimize branches.
>>>
>>>So in 1 small example we see both the strength of the new generations of
>>>processors released after 1996 (pentiumpro/klamath and newer) and the
>>>weakness of the software (visual c++ 6.0 despite pentiumpro released
>>>in 1996 already still with service packs not using P6 instructions) and the
>>>general inefficiency of the GNU world who isn't using "640KB should be enough
>>>RAM", but instead still is using the lemma "2 registers will do".
>>>
>>>Best regards,
>>>Vincent Diepeveen
>>>diep@xs4all.nl
>>
>>Actually using fewer registers is generally regarded as more optimized. I'm sure
>
>less instructions within the 'invariant' (i fear it might be a dutch word of a
>dutch professor who theoretically proved software and 'invariant' is describing
>all instructions which are getting executed within a loop) is excellent of
>course. Not doing the loading of the pointer within the invariant is trivially
>faster for most loops.

I hope you mean moving the invariant out of the loop is faster.

>>that on architectures with billions of registers like Itanium GCC will do just
>>fine.
>
>fine is a relative statement. I would say horrible. I am very sure GCC's
>excellent achievements now for DIEP at the k7 is a temporarily victory and
>showing very clearly AMD needs its own compiler team. If GCC's victory is not
>limited to the K7 then the other compilers would suck ass for 64 bit processors
>and they will perform worse than a PII at the same clockspeed would do.

AMD doesn't have a budget as big as Intel's. Yes, I think it would be great if
AMD had their own compiler team. Considering they have been losing millions of
dollars each quarter, do you think they're very likely to start one soon?

Itanium is -completely- different from x86. I have never had an Itanium on my
desk to play with, and I don't know how GCC or Intel C perform on it. I would
still bet that Intel C is the fastest compiler for Itanium. However, that has
absolutely nothing to do with the K7 or any other x86 processor. That has
everything to do with GCC's optimizer for the Itanium. Optimization for Itanium
revolves around instruction scheduling and branch prediction.

>The more registers a processor has the more problems GCC gets into, *trivially*.

What are you talking about? Comparing x86 performance tells you -nothing- about
how the compiler works for other architectures. Most compiler/architecture
people are convinced that more registers help the optimizer generate faster
code.

One common technique for doing register allocation optimization is to allow your
IL (intermediate language) to define an infinite (4.2 billion) number of machine
registers. Every variable and every computation goes into a register. When the
IL is translated into machine language for a target machine, the optimizer
reduces the number of concurrently used registers (by storing variables in
memory, throwing away computations, etc.) in the IL until it is equal or below
the number that the machine supports. An optimizer employing this technique
would work -better- on a machine with more registers.

-Matt



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.