Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: IA-64 vs OOOE (attn Taylor, Hyatt)

Author: Matt Taylor

Date: 00:03:56 02/12/03

Go up one level in this thread


On February 12, 2003 at 00:37:13, Robert Hyatt wrote:

>On February 11, 2003 at 23:24:43, Tom Kerrigan wrote:
>
>>On February 11, 2003 at 22:39:48, Robert Hyatt wrote:
>>
>>>Your explanation was not bad, but your "no compiler can do this" is dead
>>>wrong.  Visit Cray Research, search for their CFT compiler (or their C
>>>compiler) and see if you can find some papers on their optimizing.
>>>They _do_ exactly what you describe.  They "lift" (or "hoist") instructions
>>>way back up in the instruction stream so that values are available when needed,
>>>which is _exactly_ what your OOO approach is doing in the hardware.
>>
>>They must be doing this according to static branch prediction, which is maybe
>>80% accurate, not > 90%, and all compilers have scope boundaries for this sort
>>of stuff, i.e., at loops or functions. OOOE has no such restrictions. It's just
>>a stream of instructions.
>>
>No No No.  They do much of this with 100% accuracy.  Because they make sure
>that the critical instructions are executed in _every_ path that reaches a
>critical point in the data-flow analysis of the program (the dependency graph
>for gcc users)...
>
>BTW OOOE has a huge limit.  Something like 40-60 (I don't have my P6/etc
>manuals here at home) micro-ops in the reorder buffer.  No way to do any
>OOOE beyond that very narrow peephole, while the compiler can see _much_
>more, as much as it wants (and has the compile time) to look at...

72 macro-ops on Athlon in the ICU, 18 in the integer scheduler, and 36 in the FP
scheduler, and that was "revolutionary" for IA-32 processors.

The P6 holds 40 micro-ops in the ROB. Fortunately it didn't matter much for the
P6 because its decoders were painstakingly limited -- only 1 instruction per
cycle unless you could fit 1-byte instructions after the first decoded
instruction.

Athlon has 3 full decoders. I recently looked more into No Predecode stalls --
their architecture is remarkably similar to Intel's trace cache idea. When the
processor fetches instructions, it decodes and stores instruction width and
other information in the instruction cache alongside the code bytes themselves.
With this information, Athlon is able to decode at a rate of up to 3
instructions/cycle. Otherwise Athlon slows to a grinding halt. I don't know how
fast it can decode without predecode info, but I suspect it is near 1
instruction/cycle.

>Someone posted an example of such code (I think it was Matt) showing
>Vincent how to dump branches.  That is the idea here.  The advantage of
>OOO execution is still around, but it is not as significant.  This being
>the fact that the _real_ code doesn't get bigger, while when the compiler
>is busy doing these same kinds of optimizations, it is replicating various
>instructions to be sure they are completed by the time the DG says the
>result is going to be needed.  So there is a bit of memory savings when the
>processor does the OOO stuff, and there is the advantage of exposing more
>registers when the real instructions get turned into micro-ops...  but at
>least the latter is more a result of a horrible architecture (8 registers)
>as opposed to the fact the OOO execution is a huge boon for other architectures
>that are not so register-challenged...

IA-32 and IA-64 have slightly different strategies. The stuff I was showing
Vincent was pretty cool -- you basically do the comparison, get a 0 or 1 with
setcc, and use that value to no-op the computation. Usually this involves
negation to get a mask of 0's or 1's. Decrement is better if you can invert the
comparison. That's not always possible, and it's not always easy, and data
dependency means you have to aggressively schedule. Conditional move helps solve
some problems, but it also has limitations.

Predication makes it -much- easier. Ideally you could predicate function calls,
though I don't think IA-64 allows this. I have not checked.

>>>I would not say that either is particularly "better".  They are "different"
>>>with different approaches to the same problem.  The advantage of a ia64-type
>>>approach is that you can stretch the VLIW approach quite a ways, while it
>>>gets harder and harder to do it in an OOO architecture.  You end up with more
>>>hardware in the reorder buffer logic than you have in the actual pipelines
>>>that do the real computation.
>>
>>Is that causing a problem other than offending some people's sensibilities? The
>>EV8 was going to have 8 int ALUs and it would have been perfectly viable with
>>today's processes.
>
>Sure.  But given the choice of OOOE with 8 int alus, or no OOOE with 16
>int alus and an instruction package large enough to feed them all, I would
>consider the latter seriously...
>
>
>>>Perhaps.  However the non-OOO Cray has always been right at the top of the
>>>overall performance heap, so that approach can fly as well and it has certainly
>>
>>I don't know much about Crays but a friend of mine told me that he ran some
>>uniprocessor tests on a Cray and it was roughly as fast as a fast 486. Have any
>>Crays been built in the last several years using actual Cray processors?
>
>Your friend was nuts.  The first Cray-1 at 12.5ns clock would blow off any
>486 ever made.  That machine could do 80 mips, which would not be bad for
>a 486.  But it could do 6-8 _operations_ per clock cycle, and those are
>64 bit floating point operations.  The 486 has no chance.
>
>The Cray T932 was the last 64 bit machine they built that I used.  And it
>can produce a FLOP count that no PC on the planet can come within a factor of
>10 of and that is being very generous.  2ns clock, 32 cpus, each cpu can read
>four words and write two words to memory per clock cycle, and with vector
>chaining, it can do at _least_ eight floating point operations per cycle per
>CPU.

For comparison, using SSE allows Athlon to achieve 3.8 FLOPs/cycle. That's 8.55
GFLOPs/sec on the 2.25 GHz AthlonXP 2800. I haven't studied SSE timings on the
Pentium 4. It's clock may allow it to hit a higher theoretical rate.

>>>>As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not
>>>>using them. That's why they're not generated.
>>>Under what circumstance?  It is possible to have totally unpredictable
>>
>>Under the circumstance of running on a P6, like I said. The P6 has no real
>>support for cmovs; it microcodes them down to several uops and the whole process
>>takes longer than if you just did a branch.
>
>I did a branchless FirstOne() in asm a few weeks back here, just to test.
>It used a cmov, and it wasn't slower than the one with a branch.  If the
>branch is accurately predictable, they should probably break even.  If the
>branch is not very predictable, then the cmov should be faster...  I don't
>have the details handy, but I can't imagine it turning into "several" uops.
>two or three, perhaps, assuming we are not talking about a memory reference
>thrown in which would add one.

P6 optimization manual says 2 for reg-reg. An extra load u-op is required for
the reg-mem. The bsf instruction which is regarded as being rather fast on the
P6 also breaks down to 2 u-ops for reg-reg. Technically everything is
microcoded...

>>>And I believe the VC _will_ produce CMOV instructions, but you have to
>>
>>Eugene can speak to this better than I can, but I don't think VC ever produces
>>cmovs regardless of how you configure it.
>>
>>-Tom
>
>
>I'm not sure why, if that is true.  The PIV has an even longer pipeline
>with a larger branch misprediction penalty...

I've never gotten VC to generate cmov, even after enabling every optimization
flag I could find. It will generate setcc which is often just as useful. I was
sorely disappointed to discover that the latency of setcc on Pentium 4 is 5
cycles. It can only be issued every 1.5 cycles.

You once showed me some code that Eugene optimized. It was similar to the setnz
I was using to conditionally address the low or high portions of a 64-bit
variable in memory, but it used sbb to set the register to 0 or -1. I was
largely ignorant of the gory details on P4 until recently -- it was pointed out
to me that the adc/sbb instructions have a latency of 8 cycles in reg-reg form.
(Oddly enough, they have a latency of 6 cycles in reg-imm form.) Yuck.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.