Author: Matt Taylor
Date: 02:53:37 12/15/02
Go up one level in this thread
On December 14, 2002 at 14:10:14, Robert Hyatt wrote: >>>>HT will not scale to large numbers of tasks. The IA-32 register set has 8 32-bit >>>>general registers, 8 80-bit FPU registers, 2 16-bit FPU control registers, and 8 >>>>128-bit SSE registers. This means each logical CPU requires 244 bytes of >>>>application register file alone. For simplicitly, I did not include the 3 groups >>>>of system registers, the MTRRs, or the MSRs. There are additional caches which >>>>would not allow HT to scale unless they were duplicated. Intel is not about to >>>>put MBs of fast-access register file on an IA-32 processor. It would make your >>>>128-cpu HT Pentium 5 cost more than a cluster of Itaniums with negligible >>>>performance gain over a dual- or quad-Xeon system. >>> >>>Want to bet? 10 years ago they could not put cache on-chip due to space >>>limitations. That is now gone. With up to 2mb of L2 cache on chip today, >>>they really can do whatever they want in the future. And there are not >>>really duplicate sets of registers. Just duplicate rename tables which are >>>much smaller since they only store small pointers to real registers, rather >>>than 32 bit values. >> >>If HT does not have 2 physical sets of registers, what do the remap tables point >>to? Intel docs actually state that the logical CPU has a duplicated set of >>registers. > >No they don't. They have one large set of rename registers, and _two_ rename >tables. >This has been discussed here in the past and can be found in the intel docs. If >you had >two separate sets of registers, how would the execution units deal with that? >The >answer: they couldn't. But by using one large set of registers, with a >separate rename table >for each processor, they simply take care to be sure that two concurrent threads >do not rename >eax to the same rename register, and then there is no confusion and the >execution units don't >care which thread produced the micro-ops being executed. It makes the design >_way_ >simpler and it is definitely elegant... If you read what you just said, you're arguing the same thing I said. I wasn't specific with semantics, but for 2 logical CPUs you need at least twice the basic number of physical registers. For 8 CPUs, you will need 8 times that number. You can't have 32,768 CPUs without 32,768 physical sets of registers. It doesn't matter if they get lumped together. You still need space. >>Also, it is important to differentiate between registers and L2 cache which can >>take from 6-300 clocks to access. Even the 8 KB L1 cache that the P4 has takes 2 >>cycles to access. If it were possible to build a chip with such vast amounts of >>fast cache memory, there would be no such thing as an on-chip cache heirarchy; >>there would be an L1 cache on-chip and -maybe- an L2 cache on the motherboard. > > >It isn't the "size" that counts. It is the physical space that counts. To make >it larger, >you limit where you can put it on the chip. The farther away, the longer it >takes to >access the data. Small caches can be tucked into holes here and there. But not >a 2mb >cache that takes 10-12M transistors just for the flip-flops for the bits, not >counting the >addressing and so forth. 8 times the number of registers = 8 times the size... When you include system registers, I would expect much closer to 1 KB or so of register space, and perhaps even more. I forgot to also include the segment registers, each a bit bulkier than 8 bytes. You said that size limits the distance. If so, then how does a large amount of register space -not- limit the distance? >>The other argument worth making is that HT will hit diminishing returns very >>quickly. It -may- not even be worth going to quad-HT. The main reason why HT >>gets any performance gains is because two threads don't fully utilize the CPU's >>execution capacity. It is convenient when a cache miss occurs because one thread >>can utilize the full capacity of the processor, but across -most- applications >>that is rare. Additionally, the processor has the ability to speculatively fetch >>data and code. Cache misses are rare. > >They aren't rare. And you aren't thinking the math thru. Intel hopes for 95% >hits. >That leaves 5% misses. And for each miss, we sit for 400 clocks or so waiting. >IE >one out of every 20 accesses misses, and the slowdown is tremendous. HT fills >in >those long idle periods pretty well and going to 4 or 8 should not hit any >asymptote >assuming they can feed micro-ops to the core quickly enough... They can feed up to 3 u-ops/cycle. Splitting 3 u-ops/cycle among 4-8 threads... That also neglects the fact that now you have 4-8 threads competing for main memory, so it may take 4-8 times longer to fill a request. >>One of my machines has a BIOS option to disable the on-chip caches. When I >>disable it, my 1.2 GHz Thunderbird runs extremely slow. Every memory access >>effectively becomes a cache miss. If you have a machine with this option, you >>can try it and see. If cache misses happened often enough to make a viable >>impact on HT, you wouldn't see a big difference. > >Again, you are not thinking thru on the math. Turn L1/L2 off and _everything_ >takes 400 clocks to read. Turn it on and 95% of the stuff doesn't, but if you >know >anything about Amdahl's law, that last 5% is a killer, meaning that even if you >drop the cpu clock cycle time to zero picoseconds, the cpu could not run 20x >faster >due to that last 5% being bound by memory speed, not processor clock cycle time. Amdahl's Law applies to parallel computing. Memory accesses aren't really a parallel computation problem... Actually, it doesn't take 400 clocks to read memory; it takes 105 clocks for my AthlonMP 1600 to read a random segment of memory for me (multiplier * 10 bus clocks). That assumes that you have to re-latch RAS. If my processor doesn't need to re-latch RAS, then it takes only about 27 clocks. Most memory accesses are stack accesses, and stack accesses are almost -always- cache hits. >>>>HT is merely a way to make the existing hardware more efficient. If it were >>>>anything more, it would add -additional- hardware registers so the OS could >>>>control the scheduling algorithm and specify the location of the ready queue. It >>>>would also add instructions that would allow the processor to switch tasks. >>> >>>The processor _already_ is doing this. But for processes that are ready to >>>run rather than for processes that are long-term-blocked for I/O, etc. >> >>Yes, but the scheduler's job is to pick who runs, when they run, and how long >>they run. HT only affects the first by allowing the scheduler to pick two tasks >>to run instead of just one. HT isn't replacing the scheduler; it only >>complicates it. > >No. It simplifies the task when I can say "here run both of these". Rather >than having >to bounce back and forth as a quantum expires. It will get better when HT goes >to >4-way and beyond... Again, as far as the OS is concerned, HT = SMP. Are you trying to explain to me that SMP schedulers are simpler than single-CPU schedulers? Synchronizing ready queues and picking multiple processes out is easier than not synchronizing and picking the one off the top? -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.