Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: final note I presume

Author: Robert Hyatt

Date: 19:01:10 12/15/02

Go up one level in this thread


On December 15, 2002 at 05:53:37, Matt Taylor wrote:

>On December 14, 2002 at 14:10:14, Robert Hyatt wrote:
>
>>>>>HT will not scale to large numbers of tasks. The IA-32 register set has 8 32-bit
>>>>>general registers, 8 80-bit FPU registers, 2 16-bit FPU control registers, and 8
>>>>>128-bit SSE registers. This means each logical CPU requires 244 bytes of
>>>>>application register file alone. For simplicitly, I did not include the 3 groups
>>>>>of system registers, the MTRRs, or the MSRs. There are additional caches which
>>>>>would not allow HT to scale unless they were duplicated. Intel is not about to
>>>>>put MBs of fast-access register file on an IA-32 processor. It would make your
>>>>>128-cpu HT Pentium 5 cost more than a cluster of Itaniums with negligible
>>>>>performance gain over a dual- or quad-Xeon system.
>>>>
>>>>Want to bet?  10 years ago they could not put cache on-chip due to space
>>>>limitations.  That is now gone.  With up to 2mb of L2 cache on chip today,
>>>>they really can do whatever they want in the future.  And there are not
>>>>really duplicate sets of registers.  Just duplicate rename tables which are
>>>>much smaller since they only store small pointers to real registers, rather
>>>>than 32 bit values.
>>>
>>>If HT does not have 2 physical sets of registers, what do the remap tables point
>>>to? Intel docs actually state that the logical CPU has a duplicated set of
>>>registers.
>>
>>No they don't.  They have one large set of rename registers, and _two_ rename
>>tables.
>>This has been discussed here in the past and can be found in the intel docs.  If
>>you had
>>two separate sets of registers, how would the execution units deal with that?
>>The
>>answer:  they couldn't.  But by using one large set of registers, with a
>>separate rename table
>>for each processor, they simply take care to be sure that two concurrent threads
>>do not rename
>>eax to the same rename register, and then there is no confusion and the
>>execution units don't
>>care which thread produced the micro-ops being executed.  It makes the design
>>_way_
>>simpler and it is definitely elegant...
>
>If you read what you just said, you're arguing the same thing I said. I wasn't
>specific with semantics, but for 2 logical CPUs you need at least twice the
>basic number of physical registers. For 8 CPUs, you will need 8 times that
>number. You can't have 32,768 CPUs without 32,768 physical sets of registers. It
>doesn't matter if they get lumped together. You still need space.

yes...  but there is a big difference between two separate register files and
one larger register file.    And if you are having problems executing
instructions for all threads, you certainly aren't using all the rename
registers that are available...

But the interesting thing is that to go to 4-way SMT it is very likely that you
don't have to go to 4x the rename register file size.  You can tune for the
typical case and just allow it to run out in rare cases, and stall the decoder
until some instructions retire and free up more registers.


>
>>>Also, it is important to differentiate between registers and L2 cache which can
>>>take from 6-300 clocks to access. Even the 8 KB L1 cache that the P4 has takes 2
>>>cycles to access. If it were possible to build a chip with such vast amounts of
>>>fast cache memory, there would be no such thing as an on-chip cache heirarchy;
>>>there would be an L1 cache on-chip and -maybe- an L2 cache on the motherboard.
>>
>>
>>It isn't the "size" that counts.  It is the physical space that counts.  To make
>>it larger,
>>you limit where you can put it on the chip.  The farther away, the longer it
>>takes to
>>access the data.  Small caches can be tucked into holes here and there.  But not
>>a 2mb
>>cache that takes 10-12M transistors just for the flip-flops for the bits, not
>>counting the
>>addressing and so forth.
>
>8 times the number of registers = 8 times the size...
>When you include system registers, I would expect much closer to 1 KB or so of
>register space, and perhaps even more. I forgot to also include the segment
>registers, each a bit bulkier than 8 bytes. You said that size limits the
>distance. If so, then how does a large amount of register space -not- limit the
>distance?

You don't need _that_ large an amount.  IE 16kb of register space is huge.  And
it is certainly doable on-chip and close by.  Even 1K is probably more than
enough to get thru at least 4-way SMT, that is 256 registers which is a bunch.

>
>>>The other argument worth making is that HT will hit diminishing returns very
>>>quickly. It -may- not even be worth going to quad-HT. The main reason why HT
>>>gets any performance gains is because two threads don't fully utilize the CPU's
>>>execution capacity. It is convenient when a cache miss occurs because one thread
>>>can utilize the full capacity of the processor, but across -most- applications
>>>that is rare. Additionally, the processor has the ability to speculatively fetch
>>>data and code. Cache misses are rare.
>>
>>They aren't rare.  And you aren't thinking the math thru.  Intel hopes for 95%
>>hits.
>>That leaves 5% misses.  And for each miss, we sit for 400 clocks or so waiting.
>>IE
>>one out of every 20 accesses misses, and the slowdown is tremendous.  HT fills
>>in
>>those long idle periods pretty well and going to 4 or 8 should not hit any
>>asymptote
>>assuming they can feed micro-ops to the core quickly enough...
>
>They can feed up to 3 u-ops/cycle. Splitting 3 u-ops/cycle among 4-8 threads...
>That also neglects the fact that now you have 4-8 threads competing for main
>memory, so it may take 4-8 times longer to fill a request.

Of course it might.  But that is _the_ point.  For those cases, SMT won't do
much.  Any more than running two chess engines on a single cpu is very
efficient.  But for the right mix the gain is significant.  IE I can run one
compute-bound process and one heavily I/O-bound process and if they each take
one hour to run by themselves, they will not take much more than one hour to
run together.  Run two I/O jobs together and they take two hours.  Ditto for
two compute bound processes.  But for reasonable "mixes" the gaine is
significant, and for the optimal mixes, the gain is tremendous...

SMT is no different.  It won't speed up all kinds of applications.  But it will
speed up many.  And in particular, it is very effective for threaded apps that
share everything thru the L1/L2 caches anyway...  making cache-thrashing less of
a problem.




>
>>>One of my machines has a BIOS option to disable the on-chip caches. When I
>>>disable it, my 1.2 GHz Thunderbird runs extremely slow. Every memory access
>>>effectively becomes a cache miss. If you have a machine with this option, you
>>>can try it and see. If cache misses happened often enough to make a viable
>>>impact on HT, you wouldn't see a big difference.
>>
>>Again, you are not thinking thru on the math.  Turn L1/L2 off and _everything_
>>takes 400 clocks to read.  Turn it on and 95% of the stuff doesn't, but if you
>>know
>>anything about Amdahl's law, that last 5% is a killer, meaning that even if you
>>drop the cpu clock cycle time to zero picoseconds, the cpu could not run 20x
>>faster
>>due to that last 5% being bound by memory speed, not processor clock cycle time.
>
>Amdahl's Law applies to parallel computing. Memory accesses aren't really a
>parallel computation problem...

They are in this context, because you can do anything you want to that 95%
of the code that runs in cache, but that last 5% is the killer and that is the
part that makes SMT work...



>
>Actually, it doesn't take 400 clocks to read memory; it takes 105 clocks for my
>AthlonMP 1600 to read a random segment of memory for me (multiplier * 10 bus
>clocks). That assumes that you have to re-latch RAS. If my processor doesn't
>need to re-latch RAS, then it takes only about 27 clocks. Most memory accesses
>are stack accesses, and stack accesses are almost -always- cache hits.

It only takes 10 clocks on a 100mhz machine.  But I am talking top-end which is
in the 3,000mhz range.  And there it is 300+ clocks, period...  You keep saying
that same thing about the stack.  And it is true for _some_ programs.  It is
_not_ true for many.  Crafty is one prime example that doesn't use much stack
space between procedure calls.  I have several other similar programs here
such as a couple that do molecular modeling, some simulation codes, and
similar things that use very large static arrays of data.


>
>>>>>HT is merely a way to make the existing hardware more efficient. If it were
>>>>>anything more, it would add -additional- hardware registers so the OS could
>>>>>control the scheduling algorithm and specify the location of the ready queue. It
>>>>>would also add instructions that would allow the processor to switch tasks.
>>>>
>>>>The processor _already_ is doing this.  But for processes that are ready to
>>>>run rather than for processes that are long-term-blocked for I/O, etc.
>>>
>>>Yes, but the scheduler's job is to pick who runs, when they run, and how long
>>>they run. HT only affects the first by allowing the scheduler to pick two tasks
>>>to run instead of just one. HT isn't replacing the scheduler; it only
>>>complicates it.
>>
>>No.  It simplifies the task when I can say "here run both of these".  Rather
>>than having
>>to bounce back and forth as a quantum expires.  It will get better when HT goes
>>to
>>4-way and beyond...
>
>Again, as far as the OS is concerned, HT = SMP.

Certainly...  I hope I never said otherwise. But price-wise, HT != SMP
by a huge margin.





>
>Are you trying to explain to me that SMP schedulers are simpler than single-CPU
>schedulers? Synchronizing ready queues and picking multiple processes out is
>easier than not synchronizing and picking the one off the top?
>
>-Matt


Let's define simpler.  Do you mean "less source code"?  If so, no.  If you mean
"simpler mechanism to choose between two compute-bound processes" then a
resounding _yes_.  Let's offload the process scheduling, the quantum nonsense,
and everything else related to that into the CPU hardware and right out of
something the O/S is spending time doing.  Bouncing between two tasks at the
O/S level is a _painful_ thing to do in terms of overhead cost.  Doing it at the
SMT level is essentially free...





This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.