Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: final note I presume

Author: Robert Hyatt
Date: 20:22:32 12/16/02
On December 16, 2002 at 17:47:49, Matt Taylor wrote:

>On December 15, 2002 at 22:01:10, Robert Hyatt wrote:
>
>>On December 15, 2002 at 05:53:37, Matt Taylor wrote:
>>
>>>On December 14, 2002 at 14:10:14, Robert Hyatt wrote:
>>>
>>>>>>>HT will not scale to large numbers of tasks. The IA-32 register set has 8 32-bit
>>>>>>>general registers, 8 80-bit FPU registers, 2 16-bit FPU control registers, and 8
>>>>>>>128-bit SSE registers. This means each logical CPU requires 244 bytes of
>>>>>>>application register file alone. For simplicitly, I did not include the 3 groups
>>>>>>>of system registers, the MTRRs, or the MSRs. There are additional caches which
>>>>>>>would not allow HT to scale unless they were duplicated. Intel is not about to
>>>>>>>put MBs of fast-access register file on an IA-32 processor. It would make your
>>>>>>>128-cpu HT Pentium 5 cost more than a cluster of Itaniums with negligible
>>>>>>>performance gain over a dual- or quad-Xeon system.
>>>>>>
>>>>>>Want to bet?  10 years ago they could not put cache on-chip due to space
>>>>>>limitations.  That is now gone.  With up to 2mb of L2 cache on chip today,
>>>>>>they really can do whatever they want in the future.  And there are not
>>>>>>really duplicate sets of registers.  Just duplicate rename tables which are
>>>>>>much smaller since they only store small pointers to real registers, rather
>>>>>>than 32 bit values.
>>>>>
>>>>>If HT does not have 2 physical sets of registers, what do the remap tables point
>>>>>to? Intel docs actually state that the logical CPU has a duplicated set of
>>>>>registers.
>>>>
>>>>No they don't.  They have one large set of rename registers, and _two_ rename
>>>>tables.
>>>>This has been discussed here in the past and can be found in the intel docs.  If
>>>>you had
>>>>two separate sets of registers, how would the execution units deal with that?
>>>>The
>>>>answer:  they couldn't.  But by using one large set of registers, with a
>>>>separate rename table
>>>>for each processor, they simply take care to be sure that two concurrent threads
>>>>do not rename
>>>>eax to the same rename register, and then there is no confusion and the
>>>>execution units don't
>>>>care which thread produced the micro-ops being executed.  It makes the design
>>>>_way_
>>>>simpler and it is definitely elegant...
>>>
>>>If you read what you just said, you're arguing the same thing I said. I wasn't
>>>specific with semantics, but for 2 logical CPUs you need at least twice the
>>>basic number of physical registers. For 8 CPUs, you will need 8 times that
>>>number. You can't have 32,768 CPUs without 32,768 physical sets of registers. It
>>>doesn't matter if they get lumped together. You still need space.
>>
>>yes...  but there is a big difference between two separate register files and
>>one larger register file.    And if you are having problems executing
>>instructions for all threads, you certainly aren't using all the rename
>>registers that are available...
>>
>>But the interesting thing is that to go to 4-way SMT it is very likely that you
>>don't have to go to 4x the rename register file size.  You can tune for the
>>typical case and just allow it to run out in rare cases, and stall the decoder
>>until some instructions retire and free up more registers.
>
>You would still need somewhere to store the registers. They have to be
>persistent. Due to the architecture, the OS has to supply a task state segment.
>It would be possible to dump the registers to the segment, but it would be even
>more costly (incurring extra memory writes), and it would only work for
>integer/segment registers.

Not quite.  The OS doesn't get access to the rename register space, just the
normal EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, etc.  Because it is in the
"rename space" automatically as O/S instructions are decoded into micro-ops..

There's no reason to try to dump the rename register space as that is completely
hidden from the O/S entirely.  The OS only needs to save/restore the normal set
of registers above...



>
>>>>>Also, it is important to differentiate between registers and L2 cache which can
>>>>>take from 6-300 clocks to access. Even the 8 KB L1 cache that the P4 has takes 2
>>>>>cycles to access. If it were possible to build a chip with such vast amounts of
>>>>>fast cache memory, there would be no such thing as an on-chip cache heirarchy;
>>>>>there would be an L1 cache on-chip and -maybe- an L2 cache on the motherboard.
>>>>
>>>>
>>>>It isn't the "size" that counts.  It is the physical space that counts.  To make
>>>>it larger,
>>>>you limit where you can put it on the chip.  The farther away, the longer it
>>>>takes to
>>>>access the data.  Small caches can be tucked into holes here and there.  But not
>>>>a 2mb
>>>>cache that takes 10-12M transistors just for the flip-flops for the bits, not
>>>>counting the
>>>>addressing and so forth.
>>>
>>>8 times the number of registers = 8 times the size...
>>>When you include system registers, I would expect much closer to 1 KB or so of
>>>register space, and perhaps even more. I forgot to also include the segment
>>>registers, each a bit bulkier than 8 bytes. You said that size limits the
>>>distance. If so, then how does a large amount of register space -not- limit the
>>>distance?
>>
>>You don't need _that_ large an amount.  IE 16kb of register space is huge.  And
>>it is certainly doable on-chip and close by.  Even 1K is probably more than
>>enough to get thru at least 4-way SMT, that is 256 registers which is a bunch.
>
>8 GPRs * 4 bytes = 32 bytes
>1 flags * 4 bytes = 4 bytes
>8 FPU regs * 10 bytes = 80 bytes
>6 segregs * 8 bytes (actually more, CPU unpacks the data) = 48 bytes
>8 SSE regs * 16 bytes = 128 bytes

renaming doesn't apply to fp/SSE so far as I know.



>4 control regs * 4 bytes = 16 bytes
>6 debug regs * 4 bytes = 24 bytes
>1 TR * 4 bytes (probably more with cache...) = 4 bytes
>1 GDTR * 6 bytes = 6 bytes
>1 IDTR * 6 bytes = 6 bytes
>1 LDTR * 6 bytes = 6 bytes
>
>I lost count of the MSRs around 128, and I scrolled through 4 or 5 pages more. I
>would estimate at least 160. Not all are writable, and not all are 64-bits wide.
>However, it would be safe to assume (I think) that you have at least 100*8 bytes
>= 800 bytes of r/w MSRs. (Most of the space is consumed by MSRs related to
>performance counters and machine check.) You have an additional 354 bytes of
>register space from the above register set, most of which can't be cached in
>main memory.
>

I don't think any of those are renamable.  Only the 8 primary registers play
into that, plus the flags, so far as I recall...  but that is a recollection
and I did not go back and verify it in the old Intel papers I have around.




>I did exclude the test registers, by the way, because even though they are r/w,
>they're used to latch commands to the TLB for testing.

Again, none of those are in the rename space.


>
>The segregs, GPRs, and LDTR all get loaded from the TSS. They can be discarded
>at will (for a hefty price in most cases). However, the bulk of the register set
>can't be.
>
>Also consider that AMD is pushing their 64-bit extensions to IA-32, and it seems
>probable that Intel will actually follow AMD's lead for the second time in the
>history of x86. There are rumors (which Intel will neither confirm or deny) that
>Intel is working on a chip called Yamhill which is compatible with AMD x86-64.
>This doubles the number of SSE registers, doubles the size of the GPRs, and
>doubles the width of the GPRs. That's 96+128=224 bytes more that you'll need.


Are you sure you understand the register renaming idea?  You don't duplicate
the entire set of registers.  You simply have a set of registers that the
rename table can point to.  And most registers are not renamed since they
are not often the subject of  something like:

      mov eax, something
      inc eax
      move something,eax
      move eax, something_else
      dec eax
      mov something_else,eax

Register renaming turns that apparent set of operand conflicts (eax) into
two completely independent sets of three instructions that use a different
register for each set...

There would be no point in doing that for MSRs, and the like...  they aren't
used like the above.







>
>>>>>The other argument worth making is that HT will hit diminishing returns very
>>>>>quickly. It -may- not even be worth going to quad-HT. The main reason why HT
>>>>>gets any performance gains is because two threads don't fully utilize the CPU's
>>>>>execution capacity. It is convenient when a cache miss occurs because one thread
>>>>>can utilize the full capacity of the processor, but across -most- applications
>>>>>that is rare. Additionally, the processor has the ability to speculatively fetch
>>>>>data and code. Cache misses are rare.
>>>>
>>>>They aren't rare.  And you aren't thinking the math thru.  Intel hopes for 95%
>>>>hits.
>>>>That leaves 5% misses.  And for each miss, we sit for 400 clocks or so waiting.
>>>>IE
>>>>one out of every 20 accesses misses, and the slowdown is tremendous.  HT fills
>>>>in
>>>>those long idle periods pretty well and going to 4 or 8 should not hit any
>>>>asymptote
>>>>assuming they can feed micro-ops to the core quickly enough...
>>>
>>>They can feed up to 3 u-ops/cycle. Splitting 3 u-ops/cycle among 4-8 threads...
>>>That also neglects the fact that now you have 4-8 threads competing for main
>>>memory, so it may take 4-8 times longer to fill a request.
>>
>>Of course it might.  But that is _the_ point.  For those cases, SMT won't do
>>much.  Any more than running two chess engines on a single cpu is very
>>efficient.  But for the right mix the gain is significant.  IE I can run one
>>compute-bound process and one heavily I/O-bound process and if they each take
>>one hour to run by themselves, they will not take much more than one hour to
>>run together.  Run two I/O jobs together and they take two hours.  Ditto for
>>two compute bound processes.  But for reasonable "mixes" the gaine is
>>significant, and for the optimal mixes, the gain is tremendous...
>>
>>SMT is no different.  It won't speed up all kinds of applications.  But it will
>>speed up many.  And in particular, it is very effective for threaded apps that
>>share everything thru the L1/L2 caches anyway...  making cache-thrashing less of
>>a problem.
>
>Cache thrashing isn't a problem if your scheduler knows well enough to schedule
>your other thread instead of John Doe's other thread on the physical CPU you're
>sharing.

_if_ it is ready to run.  If it is ready to run at the time the secong logical
processor becomes idle.  Etc...


>
>>>>>One of my machines has a BIOS option to disable the on-chip caches. When I
>>>>>disable it, my 1.2 GHz Thunderbird runs extremely slow. Every memory access
>>>>>effectively becomes a cache miss. If you have a machine with this option, you
>>>>>can try it and see. If cache misses happened often enough to make a viable
>>>>>impact on HT, you wouldn't see a big difference.
>>>>
>>>>Again, you are not thinking thru on the math.  Turn L1/L2 off and _everything_
>>>>takes 400 clocks to read.  Turn it on and 95% of the stuff doesn't, but if you
>>>>know
>>>>anything about Amdahl's law, that last 5% is a killer, meaning that even if you
>>>>drop the cpu clock cycle time to zero picoseconds, the cpu could not run 20x
>>>>faster
>>>>due to that last 5% being bound by memory speed, not processor clock cycle time.
>>>
>>>Amdahl's Law applies to parallel computing. Memory accesses aren't really a
>>>parallel computation problem...
>>
>>They are in this context, because you can do anything you want to that 95%
>>of the code that runs in cache, but that last 5% is the killer and that is the
>>part that makes SMT work...
>>
>>
>>
>>>
>>>Actually, it doesn't take 400 clocks to read memory; it takes 105 clocks for my
>>>AthlonMP 1600 to read a random segment of memory for me (multiplier * 10 bus
>>>clocks). That assumes that you have to re-latch RAS. If my processor doesn't
>>>need to re-latch RAS, then it takes only about 27 clocks. Most memory accesses
>>>are stack accesses, and stack accesses are almost -always- cache hits.
>>
>>It only takes 10 clocks on a 100mhz machine.  But I am talking top-end which is
>>in the 3,000mhz range.  And there it is 300+ clocks, period...  You keep saying
>>that same thing about the stack.  And it is true for _some_ programs.  It is
>>_not_ true for many.  Crafty is one prime example that doesn't use much stack
>>space between procedure calls.  I have several other similar programs here
>>such as a couple that do molecular modeling, some simulation codes, and
>>similar things that use very large static arrays of data.
>
>A 3.06 GHz Pentium 4 has a multiplier of 23.0. On pc2100 DDR SDRAM, this is 230
>clocks worst-case, 43-56 clocks when RAS latency isn't incurred.

I will say this again, carefully.  I _run_ on these machines.  I haven't found
a single machine in a year that can produce that kind of low latency.  Just do
random accesses, spread out at _least_ as long as a cache line, and run it for
a good while.  I get 100-130ns as the average latency.  At 3.06ghz, that turns
into 300 clocks +...

Someone in r.g.c.c. argued the same point.  But when he ran the above test,
he came back and reported "You are correct, I get 135ns latency, not the
60 I was expecting"


>
>AMD was, I believe, the first to introduce explicit prefetch capabilities to
>their K6-2. Intel added prefetch capabilities in the P3, and every CPU since has
>had it. It doesn't benefit all applications, but applications that can operate
>on "very large static arrays of data" benefit quite nicely, particularly when
>manipulations take more than a few cycles of time. I was reading a nice document
>from Intel the other day published c.a. '99 about computation of prefetch
>scheduling distance.

Notice I am talking about random-access.  No prefetching in the world will
help that.  And that is the worst-case latency example.  And it is going to
be well over 100ns no matter what all the prefetching examples suggest...

prefetching will actually hurt, in that case.




>
>>>>>>>HT is merely a way to make the existing hardware more efficient. If it were
>>>>>>>anything more, it would add -additional- hardware registers so the OS could
>>>>>>>control the scheduling algorithm and specify the location of the ready queue. It
>>>>>>>would also add instructions that would allow the processor to switch tasks.
>>>>>>
>>>>>>The processor _already_ is doing this.  But for processes that are ready to
>>>>>>run rather than for processes that are long-term-blocked for I/O, etc.
>>>>>
>>>>>Yes, but the scheduler's job is to pick who runs, when they run, and how long
>>>>>they run. HT only affects the first by allowing the scheduler to pick two tasks
>>>>>to run instead of just one. HT isn't replacing the scheduler; it only
>>>>>complicates it.
>>>>
>>>>No.  It simplifies the task when I can say "here run both of these".  Rather
>>>>than having
>>>>to bounce back and forth as a quantum expires.  It will get better when HT goes
>>>>to
>>>>4-way and beyond...
>>>
>>>Again, as far as the OS is concerned, HT = SMP.
>>
>>Certainly...  I hope I never said otherwise. But price-wise, HT != SMP
>>by a huge margin.
>
>Depends on how you look at it. My AthlonMP box cost about $1,100 for the
>-entire- system when it was brand new, and I opted for a nice case and 1 GB of
>ram. Intel is charging $700 for an HT-enabled P4, more than what my board with
>both chips cost. :-)
>
>Undoubtedly though it isn't costing much extra to add HT capabilities to the
>chip.


This is a different issue.  Intel is Intel.  AMD is AMD.  Why does a Lincoln
sell for more than a Mercury?  Quality is not that different.  Name recognition
is, however.




>
>>>Are you trying to explain to me that SMP schedulers are simpler than single-CPU
>>>schedulers? Synchronizing ready queues and picking multiple processes out is
>>>easier than not synchronizing and picking the one off the top?
>>>
>>>-Matt
>>
>>
>>Let's define simpler.  Do you mean "less source code"?  If so, no.  If you mean
>>"simpler mechanism to choose between two compute-bound processes" then a
>>resounding _yes_.  Let's offload the process scheduling, the quantum nonsense,
>>and everything else related to that into the CPU hardware and right out of
>>something the O/S is spending time doing.  Bouncing between two tasks at the
>>O/S level is a _painful_ thing to do in terms of overhead cost.  Doing it at the
>>SMT level is essentially free...
>
>I would still say no to the "simpler mechanism to choose between two
>compute-bound processes." The quantum nonsense still exists, and process
>scheduling still occurs. It occurs twice as often.

With two compute bound processes it _never_ occurs.  Which is the point.
Because there are two processors and two runnable processes.  There is no need
to have quantum runout and run the process queue, when you _know_ there is
nothing to run.

I can't vouch for all systems, but the old xerox MP system I did O/S work on
certainly behaved like this...



>
>Obviously the OS scheduler -can't- schedule with such a fine granularity because
>of scheduling overhead. HT would simplify a scheduler that tried to, but since
>none can, it's more of an extension to existing schedulers. It doesn't simplify
>anything; it just makes it possible to schedule at sub-nanosecond intervals.
>
>-Matt


It lets me pick the only two runnable processes, get them going, and then go
away until an interrupt occurs.  That is more efficient.
Re: final note I presume Matt Taylor 02:30:32 12/17/02
- Re: final note I presume Robert Hyatt 07:24:55 12/17/02
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.