Author: Matt Taylor
Date: 23:41:50 02/09/03
Go up one level in this thread
On February 09, 2003 at 23:31:35, Tom Kerrigan wrote: >On February 09, 2003 at 22:19:12, Matt Taylor wrote: > >>The compiler has time to evaluate many different orderings. Furthermore, the >>compiler has more flexibility in reordering; the processor can only reorder > >It can do _static_ reordering, not dynamic. Reordering is reordering. Optimization at compile-time has more potential than optimization at run-time. Run-time reordering has limited foresight. Dynamic reordering is valuable when you have a few registers so you can kind've sort've make use of the 40 internal registers on IA-32 chips, but IA-64 has many. So what? >>>If static scheduling is better than dynamic, why does McKinley deliver fewer >>>SPECint/GHz than the similarly clocked 21364, SPARC64, and PA-RISC 8700 chips? >>Actually, last I checked Sparc-64 scored a tad lower than IA-32 at the same >>clock speed. I was looking just yesterday and noted that Sparc scores were >>rather low. > >Actually, last I checked was 1 second ago: >http://www.aceshardware.com/SPECmine/index.jsp?b=0&s=2&v=1&if=0&r1f=2&r2f=0&m1f=0&m2f=0&o=0&o=1 Yes. It appears I was looking at a 32-bit Sparc machine. I was reading a paper that quoted the figure, not SPEC scores. The 32-bit SPEC scores for Sparc are similar. It seems the SPEC scores are generally higher on chips with more cache, and the only McKinley score listed has a 1.5 MB L3 cache. >>You are right that HT is pointless on VLIW chips. How is this a weakness? It >>means the chip is already efficient enough that HT would not help it. That is >>the point of VLIW computing! You don't need things like HT because your machine >>is -already- efficient. Conversely IA-32 is weak in the area of efficiency. > >Nonsense. How many times are those IA-64 instruction bundles padded with NOPs >because there isn't enough ILP to fill them or fit the pairing restrictions? >Each one of those NOPs means an idle execution unit that could be devoted to >processing another thread. Also, IA-64 suffers from memory latency just as much >as the next chip. (Well, moreso, because it's in-order.) All that idle time >waiting on memory could be spent processing another thread that's probably in >cache. Most chips do memory in-order. It's a side-effect of memory-mapped I/O. Again, I have no actual experience with an IA-64 machine because they're rather expensive. I can only rely on what I've read. I have never read anything about low IPC on IA-64. Please offer some evidence/article. >>IA-64 can do up to 6 instructions per cycle; the best IA-32 offers is 3. Again, > >Which doesn't matter. The Pentium III retires about 1.2 instructions per clock >and McKinley doesn't do much better. Someone else on the CCC once quoted 1.7 instructions per clock for the Pentium 3. If McKinley doesn't do better, it's the compiler's fault. Ignoring memory latency, I can hit 3 instructions per clock in hand-tuned Athlon assembly. More registers makes it easier for the compiler to optimize. Three-operand instructions also make things easier for the compiler. In compiler-generated code, my Athlon tends to retire closer to 2 instructions per clock. I would assume that McKinley does better. The restrictions really aren't that bad. >>>That has nothing to do with the high latency of the register file caused by >>>having so damn many registers. >>Is this an assumption, or do you have proof? It would be an awkward machine >>indeed if general register accesses weren't 1-cycle. > >What is your explanation of McKinley's low clock speed? The SPARC's huge >register file is often blamed for its low clock speed. Then I was ignorant of what you meant. Granted. >>I did not mean to imply that the entire application fit in 256KB. However, > >Imply? That's exactly what you said. No, I asked how hash tables which are commonly used fit in 256 KB. >>You said 1M for MMX and 2M for SSE. 1+2=3. I lump media instructions together >>because both MMX and SSE are equally useless to compilers. (Of course I'm >>ignoring the Intel C exception of using scalar SSE -- not useful to chess >>programs, not very good justification of SSE either when they could have >>introduced new flat-register FP instructions.) > >SSE2 is flat-register FP instructions... > >-Tom Original SSE is flat-register FP. SSE 2 allows double-precision FP computation. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.