Author: Robert Hyatt
Date: 10:38:25 12/14/02
Go up one level in this thread
On December 14, 2002 at 02:01:20, Matt Taylor wrote: >On December 13, 2002 at 22:51:35, Robert Hyatt wrote: > >>On December 13, 2002 at 21:55:09, Matt Taylor wrote: >> >>><snip> >>>>>the problem is you lose time to the ECC and registered features of the >>>>>memory you need for the dual. of course that's the case for all duals. >>>>>both K7 MP and Xeon suffer from that regrettably. >>>> >>>>That is not true. The duals do _not_ have to have ECC ram. And it doesn't >>>>appear to be >>>>any slower than non-ECC ram although I will be able to test that before long as >>>>we have >>>>some non-ECC machines coming in. >>> >>>Actually he is correct about the registered ram. The "registered" feature is >>>that it delays longer than unregistered ram. This is important for stability. It >>>doesn't affect bandwidth, but it does affect latency. >>> >>><snip> >>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved >>>>>the SMT another bit. >>>>> >>>>>Seems to me they working for years to get that SMT/HT slowly better working. >>>> >>>>Not "for years". It was announced as a coming thing a couple of years ago and >>>>several >>>>vendors have been discussing the idea. And they are going to increase the ratio >>>>of physical >>>>to logical cpus before long also... >>> >>>I don't think so. HT won't scale terribly well. I made another post about that, >>>and I won't reiterate what I said there. >>> >>>-Matt >> >> >>I don't see why the two of you make such sweeping generalizations. What is >>to prevent modifying the L1 cache to spit out 256 bits of data at once? There >>is nothing internally that can't be improved over time, and the idea of a >>4-way hyper-threaded cpu should eventually be just as effective as four >>completely separate cpus, although the price should be substantially lower. > >It's not really a "sweeping generalization." I was quite specific. I said HT >won't scale very well. And that _is_ a sweeping generalization. HT can potentially scale just as well as adding additional CPUs can scale, and probably better because the logical cpus are so tightly coupled that there is a savings in hardware space/cost that makes them very attractive. > >Currently, cost will prohibit the scaling of HT. You can argue that R&D will >eventually make it cost-effective, but that doesn't mean it will happen >tomorrow. Also, complexity of sharing the cache among threads, the size of the >cache, and the width of the cache all contribute to cost and overall chip >complexity. Internally, IA-32 processors are only 64-bits wide. This is easy to >verify. Throughput for single-precision floating-point SSE operations is only >twice the throughput for the SISD FPU. The cache isn't really shared among threads. So far as the cache is concerned there are no threads, just data to feed to the cpu... Also, nothing says that IA32 will remain on a 64 bit internal processor bus. Just look at the evolution from the first 32 bit processor to today. And those changes are transparent to the outside world so that they can be done by Intel when they choose without affecting existing code. Ditto for making the L1 trace datapath wider to dump more micro-ops at once. > >My comment, however, was based on the fact that HT shares the execution units >among threads. Neglecting memory accesses briefly, the worst-case performance on >a P4 is 33% of capacity (100% capacity = 3 u-ops/cycle). This means that at most >3 concurrent HT threads will max out the CPU. Of course, HT should actually >scale a little higher since the threads must all pause to access memory >occasionally. The question becomes, "How often do threads access memory?" A >single thread accesses memory quite often, but most accesses are cache hits. > Your math is _way_ wrong there. If I do a memory read in one thread, the CPU is going to sit idle for hundreds of clocks. There is _plenty_ of room to interlace other threads, particularly when most threads do memory reads here and there. Your statement is identical with saying that a modern processor should never be asked to run more than one process at a time because it has only one cpu to use. Yet 99.9% of those processes don't need the CPU for long periods of time... >Here are some equations for computation of optimum number of HT CPUs: >effective_ipc = ave_ins_between_mem / (ave_ins_between_mem / ave_ipc + >mem_latency) >num_ht_cpus = max_ipc / effective_ipc > >Now, some P4/P4 Xeon numbers: >max_ipc = 3 u-ops/cycle >ave_ipc = 1 u-op/cycle (assuming worst-case) >ave_ins_between_mem = 3 u-ops/cycle (also lower than actual) >mem_latency = 2 cycles (for L1 cache) L1 cache isn't _the_ issue. Real memory is. And it is getting worse as 1gb is the normal memory size now. that is a 2048 to 1 mapping to L2 cache, and far worse to L1. Since real programs access memory frequently, HT will _always_ work just as multiprogramming works on a single-cpu machine, because most threads are doing something that takes a long time to complete (I/O to disk or whatever for processes in an O/S, memory reads to a thread in the CPU. I don't see how this point is getting overlooked. If SMT won't work, then an O/S that runs more processes than CPUs won't work. No, it might not help if _all_ processes are screaming along in L1 cache, but that isn't realistic for real-world applications. They are going to have a non-trivial memory bandwidth requirement and SMT lets other threads fill in the gaps when memory reads are hanging. > >effective_ipc = 3 / (3 / 1 + 6) = 0.33 ipc >num_ht_cpus = 3 / 0.33 = 9 > >Making unrealistic assumptions that favor more HT CPUs only shows HT scaling to >9 CPUs before the threads start trampling over each other. Note that I also >unrealistically assumed that the CPU can execute all u-ops regardless of the >circumstances. I did assume that all cache accesses hit the L2, but accesses to >main memory are infrequent for general applications. This is a fair assumption >since Intel isn't marketting the P4 to people interested in chess; they're >marketting the P4 to many different people with many different applications. And you are making the assumption that all the threads are not requiring memory. If you run 9 threads, I'd bet that at any point in time less than 1/2 of them are ready to actually fire off micro-ops into the CPU core. Most are going to be hung up waiting on the memory read unit to supply needed data into a register. > >I might also point out that the number of instructions between memory accesses >is -usually- a good bit higher than 3. Some routines can go 20+ instructions in >between accessing memory. It all depends on what you're computing. Maybe. But then again, once you do that memory access you sit for several hundred clock cycles. Giving _plenty_ of spare time for other threads to run. Which is the point, as I said. > >Also, most threads get more than 1 u-op/cycle in the execution unit. This >starves other threads. If you recompute the above making the assumption that a >thread can fully utilize the CPU with an ave_ipc of 3.0, it makes HT less useful >(ideal number = 7.0 instead of 9.0). This is no suprise since HT is designed >around the central idea that an application cannot fully utilize 3 u-ops/cycle >execution bandwidth. > I won't disagree there, because their basic assumption is perfectly valid. Running nine small compute-bound tasks is not very realistic. Running 20 large applications banging on memory makes much more sense... And there HT will smoke. >As a final disclaimer, I know that this model used a lot of flawed numbers. >However, all but one favored proliferation of HT. The only figure that would >allow HT to scale efficiently to higher numbers is the latency of a memory >access. If this were significant at all for general applications, turning off >the cache wouldn't make such a huge difference since general applications would >be cache missing often already. I don't understand the statement. Latency is _the_ issue today. It will be 2x worse when cpus go 2x faster tomorrow. On a single memory read today, which takes on the order of 120ns absolute best case, a 3ghz processor will wait for 360 clocks. And since the 2x integer units are double-clocked, we miss the opportunity to execute at _least_ 720 micro-ops per integer unit. That leaves a _lot_ of time to factor in other threads to eat up those idle clock cycles... > >>There is _nothing_ inherently wrong with taking the process scheduler completely >>out of the O/S and dropping it into the CPU. It makes perfect sense as the >>operating system context-switching time is so long that we can't afford to >>block processes for memory accesses, and have to block only for things that >>take far longer like I/O operations. Inside the cpu this context switching >>time becomes nearly zero, and the gains are significant. >> >>It might take a while to get there, but you will see 16-way SMT one day. Just >>as surely as you will one day see 4 cpus on a single chip... > >HT isn't about context-switching. It's about parallelism and concurrency. >Threads in HT are concurrent. HT has as much to do with process scheduling as >SMP does -- just about nil. Absolutely incorrect. On a single cpu machine, the operating system gets to switch back and forth between two compute bound processes. On a hyper-threaded cpu this does _not_ happen. The cpu schedules _both_ processes and lets the cpu interlace them much more efficiently. That ought to be obvious if you have any O/S experience... > >You keep touting that the processor can do things while a thread blocks on >memory, but this is extremely rare. HT is really to take advantage of execution >units not normally utilized. While the memory aspect is appealing for a guy >designing a chess engine, Intel didn't add HT to appeal to a niche crowd. In the >general case, HT isn't going to do a lot for you. The _first_ implementation is not doing badly. The _second_ generation will be better. This is not a "new topic" it has been discussed in computer architecture forums for 10 years now. Only recently has the chip density reached the point where it became possible to actually do this. It won't be too long before it is just as standard as the super-scalar architectures are today. 10 years ago many were saying "this won't work (super-scalar)." It _does_ work and it works well as the kinks are ironed out. HT/SMT is nothing different. And it will be just as normal in a few years on all processors as super-scalar is today. > >-Matt
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.