Author: Matt Taylor
Date: 16:36:48 02/09/03
Go up one level in this thread
On February 09, 2003 at 03:21:45, Tom Kerrigan wrote: >On February 09, 2003 at 00:14:46, Matt Taylor wrote: > >>On February 07, 2003 at 08:09:23, Tom Kerrigan wrote: >> >>>On February 07, 2003 at 03:10:46, Matt Taylor wrote: >>> >>>>There is another subtle difference, too; IA-64 is heavily optimized in software >>>>whereas IA-32 is heavily optimized in hardware. In IA-64 it is possible to >>>>achieve rates closer to the theoretical 6 instructions per clock than it is on >>>>IA-32. >>> >>>Possibly only because it runs at a much lower clock speed. >> >>Um, possibly because that is the philosophy in VLIW chip design... >> >>I stick a bunch of execution units (carefully picked, of course) in my CPU, just >>as I would if I were building the next Pentium. The difference is that I don't >>waste a lot of transistors on reordering and such to get more parallelism; I >>just let the compiler optimize for my specific mix. >> >>IA-64 comes much closer to theoretical speed because of things like predication >>and its loop counter. (Plus it uses a register stack like Sparc.) > >You're assuming that software scheduling does a better job than hardware >scheduling but you have no data to back up that assumption. Prefetching and >predication are very poor substitutes for out-of-order execution. They make >writing software (or at least compilers) more difficult and they often waste >valuable memory bandwidth and execution units. Pentium 4 makes a pretty good counter-argument here. The compiler has temporal advantages. The compiler does not have to produce a result in real-time. That alone should make the conclusion obvious. More time enables the compiler to examine a broader range of optimizations. I'm not sure why you look down on predication; so branch misprediction is a better solution? -Many- branches are very short and involve similar computation. I did not even mention prefetching because it is already present in IA-32 (since the K6-2 and Pentium 3), and yes, it is difficult for a compiler to take advantage of that. I'm not sure how you conclude that predication is difficult to implement, either. I can see easy ways to do predication. I'm also not sure how you conclude that predication wastes memory bandwidth -- small branches will issue the same instructions + branches, and both branches usually get loaded as part of the same cache line. The predicated version uses fewer instructions and actually reduces the bandwidth requirements. >As for the SPARC register stack, it's widely accepted that it doesn't >significantly improve performance and it makes the register file big enough to >hurt clock speed (which is one of the main reasons why IA-64 chips are clocked >so slow). It all but prevents register file duplication or caching, like in >Alphas... As I understand it, IA-64 uses a stack like Sparc, but unlike Sparc it has hardware that unwinds the stack automatically. That's a whole lot more efficient. >>No, actually. I have never used a McKinley; I've only seen it on paper. Still, >>the P4 3.06 GHz has 512K of L2 cache, and the McKinley has 3 or 6 MB. Now I >>can't remember whether 6 MB is Itanium-III or McKinley. > >Doesn't matter for computer chess. Every program I know about (with the >exception of HIARCS) has a working set of < 256k. Code and data all fit in 256 KB? Impressive. I rarely see that even in programs an order of magnitude less complex. No hash tables? >>>>significant portions of the CPU core are dedicated to MMX/SSE and no compiler >>>>can generate MMX/SSE code, but an astute assembly programmer can write code >>The Intel compiler can generate SSE2 (instead of x87) for floating point >>>calculations. I believe gcc has library functions that make use of MMX. >>This is not the same as saying "the compiler can vectorize code." I can > >Right. You said generate MMX/SSE code, not vectorize code. They are generally the same. With the exception of SSE scalar code, the instruction sets are all vector operations. >>MMX alone eats more than 10% of an older Athlon die -- about 4M transistors on a >>42M transistor chip. 10% is pretty significant. > >Where did you get that number? I was wrong about the 42M transistor count; Palomino is 37.5M. I think the 42M was for the Williamette (or Northwood?) P4. I read it a while back. The article may have been discussing die size increases; I really don't remember. 4M transistors is on the same order as the 3M figure you posted, so it's reasonable. A much larger part of the die is the cache, but in terms of execution units themselves, MMX/SSE are extremely significant. In the case of Athlon, SSE/3DNow are really the same thing. The transition from Thunderbird to Palomino included prefecthing logic and SSE; however, the transistor count only increased by 0.5M. I don't have any data to support it, but I suspect that SSE ops get processed as 2 internal 3DNow! ops. An SSE instruction has twice the latency of its equivalent 3DNow! instruction. -Matt
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.