Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: 64-bit machines

Author: Matt Taylor

Date: 20:35:05 02/10/03

Go up one level in this thread


On February 10, 2003 at 15:09:18, Tom Kerrigan wrote:

>On February 10, 2003 at 02:41:50, Matt Taylor wrote:
>
>>>It can do _static_ reordering, not dynamic.
>>Reordering is reordering. Optimization at compile-time has more potential than
>>optimization at run-time. Run-time reordering has limited foresight.
>
>More potential, limited foresight, blah blah blah. No matter how many vague
>notions you attribute to IA-64, you still can't explain why it's not faster
>per-clock than several similarly-clocked OOO chips. Arguing with you about this
>is worthless.

So you're catching on that when I said I've never used an IA-64, I meant I've
never used an IA-64. I can't explain its performance. I've never used one. I
don't even know how well it performs since I read one thing and then hear
something completely different from you. Dr. Hyatt has already stated that it
would take a P4 4 GHz to equal a McKinley 1 GHz in Crafty according to his
numbers.

Arguing practical implementation with me is worthless because I'm not arguing
practical implementation, and I don't intend to. Again, I have never touched an
IA-64. It's just a tad out of my budget.

No, I don't know why the IA-64 SPEC scores are so low. That makes little sense
to me. However, I'm not the one to explain that. I'm not trying to, either. I'm
discussing IA-64 from the docs I've read on it.

>>Dynamic reordering is valuable when you have a few registers so you can kind've
>>sort've make use of the 40 internal registers on IA-32 chips, but IA-64 has
>>many. So what?
>
>OOO is said to increase 21264 performance by 30%. The 21264, BTW, has 32
>registers and 40 reorder registers.

And the 21264 is not a VLIW CPU. I'm going to have to reiterate that the design
philosophy behind VLIW is that the compiler makes those optimizations so no OOO
logic is necessary for performance.

The principle is pretty clear here. If you have 8 registers, static reordering
is messy because you have a much more limited window in which to schedule
things. If you have 32 registers, static reordering can be quite valuable. IA-64
has even more, obviously. The compiler is free to save more intermediate
computations which allows it to precompute data more effectively.

>>Yes. It appears I was looking at a 32-bit Sparc machine. I was reading a paper
>
>Have any 32 bit SPARCs been made since 1995?

The paper I was reading dated somewhere around '98 and compared a number of CPUs
including a Sparc and a Pentium-2. I work in software rather than hardware, and
I care more about the Sparc ISA than I do about which Sparc is which. The
timings listed in the paper showed Sparc-32 falling behind a Pentium 2
clock-for-clock.

>>It seems the SPEC scores are generally higher on chips with more cache, and the
>>only McKinley score listed has a 1.5 MB L3 cache.
>
>I can't seem to access SPEC scores right now, but what's the point of a
>super-awesome post-RISC ISA if it's just going to get beat by chips with more
>cache? And if cache really is the limiting factor in McKinley's performance
>here, it must be idle a significant amount of time, which reduces IPC and means
>HT would be beneficial.

McKinley comes with more than 1.5M of L3 cache. No test results were available
for the 3M versions. It would still not be at the level where it ought to be,
but judging by the cache effects on other CPUs, it would eclipse the 21364.

A more limited form of HT would benefit IA-64, perhaps. When I first read about
HT some 3 years ago, Intel's main claim was that it would utilize idle execution
units in the existing instruction stream -- i.e. both threads actually executing
concurrently, not swapping as they stall. That much is pointless for a VLIW
processor.

The IA-64 would be able to take advantage of idle periods using HT. However,
Intel's goal is obviously to avoid these using prefetching.

>>Again, I have no actual experience with an IA-64 machine because they're rather
>>expensive. I can only rely on what I've read. I have never read anything about
>>low IPC on IA-64. Please offer some evidence/article.
>
>It can still be relatively high and benefit from HT.

True, but why are we discussing HT now?

>>In compiler-generated code, my Athlon tends to retire closer to 2 instructions
>>per clock. I would assume that McKinley does better. The restrictions really
>
>Which tool are you using to measure that?

It is an observation.

I spend a good amount of time pouring over VC-generated code. It is not always
brilliant, but a lot of code pairs quite well. Some of the old Pentium pairing
rules still apply to Athlon, but Athlon can also squeeze in another instruction.
Getting 3 IPC in VC-generated code is rare; 2, however, is not.

Occasionally I get to look at GCC's code. The few times I have compared VC's
output with GCC's, it seems GCC always did a tad better. I did not admit that
until yesterday when I observed that GCC emits cmov in regular code when
optimizing for Pentium Pro/K6-2 or higher. I have never seen VC emit cmov; to
the best of my knowledge, it never does.

Anyway, the point is that compiler-generated code can execute very efficiently
on Athlon. Barring code specifically constructed to produce data-dependency
stalls, most instructions can freely pair. The only real nasty ones are the
push, pop, and lea instructions, and the lea usually isn't a real big deal.

>>>>ignoring the Intel C exception of using scalar SSE -- not useful to chess
>>>>programs, not very good justification of SSE either when they could have
>>>>introduced new flat-register FP instructions.)
>>Original SSE is flat-register FP. SSE 2 allows double-precision FP computation.
>
>How do you make these two statements agree?
>
>-Tom

I made several assertions, and I see no inconsistency:
1. Scalar SSE does not benefit Chess
2. Implementing vector and scalar SSE is less useful to compilers than flat-file
x87. Conveniently scalar SSE offers this, but backward compatibility would make
a flat-file x87 extension more attractive for scalar-only.
3. Original SSE is single-precision flat-register FP (vector and scalar)
4. SSE 2 is double-precision extensions to the original SSE. (Also integer
extensions which I did not mention, but irrelevant for a compiler.)

I do not look upon IA-32 vector extensions very highly as most implementations
are slow, and utilization of the extensions requires porting code rather than
changing compiler switches. It still benefits real SIMD algorithms, but it is a
nuisance to code.

Recently someone asked me to assist them in writing a Pentium 4 optimized
128-bit add. Even considering the 8-cycle latency of the adc instruction,
chaining adc's was roughly as fast as doing 2 parallel 64-bit adds in SSE and
then adjusting the high 64-bit part by either 0 or 1.

On Athlon, most MMX instructions require 2 cycles to retire; others require
more. This is complimented with an issue rate of 4 ops/cycle -- and who is going
to have 7 instructions to insert between uses of a particular piece of data? Who
even has 1? Sometimes it works out simply because it's parallel or because you
can use the integer units in combination. Other times MMX doesn't offer any
speed gain. 3DNow/SSE usually do offer a performance boost, but there are still
limitations in what measure of performance boost can be achieved.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.