Author: Matt Taylor
Date: 20:35:05 02/10/03
Go up one level in this thread
On February 10, 2003 at 15:09:18, Tom Kerrigan wrote: >On February 10, 2003 at 02:41:50, Matt Taylor wrote: > >>>It can do _static_ reordering, not dynamic. >>Reordering is reordering. Optimization at compile-time has more potential than >>optimization at run-time. Run-time reordering has limited foresight. > >More potential, limited foresight, blah blah blah. No matter how many vague >notions you attribute to IA-64, you still can't explain why it's not faster >per-clock than several similarly-clocked OOO chips. Arguing with you about this >is worthless. So you're catching on that when I said I've never used an IA-64, I meant I've never used an IA-64. I can't explain its performance. I've never used one. I don't even know how well it performs since I read one thing and then hear something completely different from you. Dr. Hyatt has already stated that it would take a P4 4 GHz to equal a McKinley 1 GHz in Crafty according to his numbers. Arguing practical implementation with me is worthless because I'm not arguing practical implementation, and I don't intend to. Again, I have never touched an IA-64. It's just a tad out of my budget. No, I don't know why the IA-64 SPEC scores are so low. That makes little sense to me. However, I'm not the one to explain that. I'm not trying to, either. I'm discussing IA-64 from the docs I've read on it. >>Dynamic reordering is valuable when you have a few registers so you can kind've >>sort've make use of the 40 internal registers on IA-32 chips, but IA-64 has >>many. So what? > >OOO is said to increase 21264 performance by 30%. The 21264, BTW, has 32 >registers and 40 reorder registers. And the 21264 is not a VLIW CPU. I'm going to have to reiterate that the design philosophy behind VLIW is that the compiler makes those optimizations so no OOO logic is necessary for performance. The principle is pretty clear here. If you have 8 registers, static reordering is messy because you have a much more limited window in which to schedule things. If you have 32 registers, static reordering can be quite valuable. IA-64 has even more, obviously. The compiler is free to save more intermediate computations which allows it to precompute data more effectively. >>Yes. It appears I was looking at a 32-bit Sparc machine. I was reading a paper > >Have any 32 bit SPARCs been made since 1995? The paper I was reading dated somewhere around '98 and compared a number of CPUs including a Sparc and a Pentium-2. I work in software rather than hardware, and I care more about the Sparc ISA than I do about which Sparc is which. The timings listed in the paper showed Sparc-32 falling behind a Pentium 2 clock-for-clock. >>It seems the SPEC scores are generally higher on chips with more cache, and the >>only McKinley score listed has a 1.5 MB L3 cache. > >I can't seem to access SPEC scores right now, but what's the point of a >super-awesome post-RISC ISA if it's just going to get beat by chips with more >cache? And if cache really is the limiting factor in McKinley's performance >here, it must be idle a significant amount of time, which reduces IPC and means >HT would be beneficial. McKinley comes with more than 1.5M of L3 cache. No test results were available for the 3M versions. It would still not be at the level where it ought to be, but judging by the cache effects on other CPUs, it would eclipse the 21364. A more limited form of HT would benefit IA-64, perhaps. When I first read about HT some 3 years ago, Intel's main claim was that it would utilize idle execution units in the existing instruction stream -- i.e. both threads actually executing concurrently, not swapping as they stall. That much is pointless for a VLIW processor. The IA-64 would be able to take advantage of idle periods using HT. However, Intel's goal is obviously to avoid these using prefetching. >>Again, I have no actual experience with an IA-64 machine because they're rather >>expensive. I can only rely on what I've read. I have never read anything about >>low IPC on IA-64. Please offer some evidence/article. > >It can still be relatively high and benefit from HT. True, but why are we discussing HT now? >>In compiler-generated code, my Athlon tends to retire closer to 2 instructions >>per clock. I would assume that McKinley does better. The restrictions really > >Which tool are you using to measure that? It is an observation. I spend a good amount of time pouring over VC-generated code. It is not always brilliant, but a lot of code pairs quite well. Some of the old Pentium pairing rules still apply to Athlon, but Athlon can also squeeze in another instruction. Getting 3 IPC in VC-generated code is rare; 2, however, is not. Occasionally I get to look at GCC's code. The few times I have compared VC's output with GCC's, it seems GCC always did a tad better. I did not admit that until yesterday when I observed that GCC emits cmov in regular code when optimizing for Pentium Pro/K6-2 or higher. I have never seen VC emit cmov; to the best of my knowledge, it never does. Anyway, the point is that compiler-generated code can execute very efficiently on Athlon. Barring code specifically constructed to produce data-dependency stalls, most instructions can freely pair. The only real nasty ones are the push, pop, and lea instructions, and the lea usually isn't a real big deal. >>>>ignoring the Intel C exception of using scalar SSE -- not useful to chess >>>>programs, not very good justification of SSE either when they could have >>>>introduced new flat-register FP instructions.) >>Original SSE is flat-register FP. SSE 2 allows double-precision FP computation. > >How do you make these two statements agree? > >-Tom I made several assertions, and I see no inconsistency: 1. Scalar SSE does not benefit Chess 2. Implementing vector and scalar SSE is less useful to compilers than flat-file x87. Conveniently scalar SSE offers this, but backward compatibility would make a flat-file x87 extension more attractive for scalar-only. 3. Original SSE is single-precision flat-register FP (vector and scalar) 4. SSE 2 is double-precision extensions to the original SSE. (Also integer extensions which I did not mention, but irrelevant for a compiler.) I do not look upon IA-32 vector extensions very highly as most implementations are slow, and utilization of the extensions requires porting code rather than changing compiler switches. It still benefits real SIMD algorithms, but it is a nuisance to code. Recently someone asked me to assist them in writing a Pentium 4 optimized 128-bit add. Even considering the 8-cycle latency of the adc instruction, chaining adc's was roughly as fast as doing 2 parallel 64-bit adds in SSE and then adjusting the high 64-bit part by either 0 or 1. On Athlon, most MMX instructions require 2 cycles to retire; others require more. This is complimented with an issue rate of 4 ops/cycle -- and who is going to have 7 instructions to insert between uses of a particular piece of data? Who even has 1? Sometimes it works out simply because it's parallel or because you can use the integer units in combination. Other times MMX doesn't offer any speed gain. 3DNow/SSE usually do offer a performance boost, but there are still limitations in what measure of performance boost can be achieved. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.