Author: Tom Kerrigan
Date: 18:01:26 02/09/03
Go up one level in this thread
On February 09, 2003 at 19:36:48, Matt Taylor wrote: >>You're assuming that software scheduling does a better job than hardware >>scheduling but you have no data to back up that assumption. Prefetching and >>predication are very poor substitutes for out-of-order execution. They make >>writing software (or at least compilers) more difficult and they often waste >>valuable memory bandwidth and execution units. >Pentium 4 makes a pretty good counter-argument here. The compiler has temporal >advantages. The compiler does not have to produce a result in real-time. That >alone should make the conclusion obvious. More time enables the compiler to >examine a broader range of optimizations. More time for the compiler to try to simulate out of order execution, you mean. If static scheduling is better than dynamic, why does McKinley deliver fewer SPECint/GHz than the similarly clocked 21364, SPARC64, and PA-RISC 8700 chips? Also, IA-64's huge register set and strict pairing rules all but rule out SMT, which is an incredibly valuable source of ILP. >I'm not sure why you look down on predication; ... >I did not even mention prefetching because it is already present I think both are great and would make great additions to ISAs that don't already have them. But being _forced_ to use prefetching and predication to order post-branch instructions ahead of branches because your architecture doesn't support out of order execution is lame. >either. I can see easy ways to do predication. I'm also not sure how you >conclude that predication wastes memory bandwidth -- small branches will issue >the same instructions + branches, and both branches usually get loaded as part >of the same cache line. The predicated version uses fewer instructions and >actually reduces the bandwidth requirements. For code, sure. What about predicated loads? >As I understand it, IA-64 uses a stack like Sparc, but unlike Sparc it has >hardware that unwinds the stack automatically. That's a whole lot more >efficient. That has nothing to do with the high latency of the register file caused by having so damn many registers. >>Doesn't matter for computer chess. Every program I know about (with the >>exception of HIARCS) has a working set of < 256k. >Code and data all fit in 256 KB? Impressive. I rarely see that even in programs >an order of magnitude less complex. >No hash tables? No, don't be stupid. A program's "working set" is the code/data that it accesses the vast majority of the time. Of course the program accesses code/data outside of its working set, but infrequently enough that it doesn't impact performance. If you run a chess program on a 1.9GHz Pentium 4 and it runs 26% faster than it does on a 1.5GHz P4, which is the case for most chess programs, you know that the program's working set is less than 256k, because the CPU core and its 256k L2 cache are the only things that scale linearly with the CPU's clock speed. And if you're already getting a linear speedup, adding 6MB of L3 cache won't improve on that. >I read it a while back. The article may have been discussing die size increases; >I really don't remember. 4M transistors is on the same order as the 3M figure >you posted, so it's reasonable. A much larger part of the die is the cache, but I didn't post 3M. I said 1M for MMX, and that's including however many transistors were necessary to double the Pentium's L1 caches. -Tom
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.