Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Matt Taylor
Date: 03:28:38 12/17/02
On December 15, 2002 at 09:39:17, Robert Hyatt wrote:

>On December 15, 2002 at 02:18:44, Matt Taylor wrote:
>
>>On December 14, 2002 at 13:38:25, Robert Hyatt wrote:
>>
>>>On December 14, 2002 at 02:01:20, Matt Taylor wrote:
>>>
>>>>On December 13, 2002 at 22:51:35, Robert Hyatt wrote:
>>>>
>>>>>On December 13, 2002 at 21:55:09, Matt Taylor wrote:
>>>>>
>>>>>><snip>
>>>>>>>>the problem is you lose time to the ECC and registered features of the
>>>>>>>>memory you need for the dual. of course that's the case for all duals.
>>>>>>>>both K7 MP and Xeon suffer from that regrettably.
>>>>>>>
>>>>>>>That is not true.  The duals do _not_ have to have ECC ram.  And it doesn't
>>>>>>>appear to be
>>>>>>>any slower than non-ECC ram although I will be able to test that before long as
>>>>>>>we have
>>>>>>>some non-ECC machines coming in.
>>>>>>
>>>>>>Actually he is correct about the registered ram. The "registered" feature is
>>>>>>that it delays longer than unregistered ram. This is important for stability. It
>>>>>>doesn't affect bandwidth, but it does affect latency.
>>>>>>
>>>>>><snip>
>>>>>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved
>>>>>>>>the SMT another bit.
>>>>>>>>
>>>>>>>>Seems to me they working for years to get that SMT/HT slowly better working.
>>>>>>>
>>>>>>>Not "for years".  It was announced as a coming thing a couple of years ago and
>>>>>>>several
>>>>>>>vendors have been discussing the idea.  And they are going to increase the ratio
>>>>>>>of physical
>>>>>>>to logical cpus before long also...
>>>>>>
>>>>>>I don't think so. HT won't scale terribly well. I made another post about that,
>>>>>>and I won't reiterate what I said there.
>>>>>>
>>>>>>-Matt
>>>>>
>>>>>
>>>>>I don't see why the two of you make such sweeping generalizations.  What is
>>>>>to prevent modifying the L1 cache to spit out 256 bits of data at once?  There
>>>>>is nothing internally that can't be improved over time, and the idea of a
>>>>>4-way hyper-threaded cpu should eventually be just as effective as four
>>>>>completely separate cpus, although the price should be substantially lower.
>>>>
>>>>It's not really a "sweeping generalization." I was quite specific. I said HT
>>>>won't scale very well.
>>>
>>>And that _is_ a sweeping generalization.  HT can potentially scale just as well
>>>as adding additional CPUs can scale, and probably better because the logical
>>>cpus
>>>are so tightly coupled that there is a savings in hardware space/cost that makes
>>>them
>>>very attractive.
>>>
>>>
>>>>
>>>>Currently, cost will prohibit the scaling of HT. You can argue that R&D will
>>>>eventually make it cost-effective, but that doesn't mean it will happen
>>>>tomorrow. Also, complexity of sharing the cache among threads, the size of the
>>>>cache, and the width of the cache all contribute to cost and overall chip
>>>>complexity. Internally, IA-32 processors are only 64-bits wide. This is easy to
>>>>verify. Throughput for single-precision floating-point SSE operations is only
>>>>twice the throughput for the SISD FPU.
>>>
>>>The cache isn't really shared among threads.   So far as the cache is concerned
>>>there are
>>>no threads, just data to feed to the cpu...  Also, nothing says that IA32 will
>>>remain on a
>>>64 bit internal processor bus.  Just look at the evolution from the first 32 bit
>>>processor
>>>to today.  And those changes are transparent to the outside world so that they
>>>can be
>>>done by Intel when they choose without affecting existing code.  Ditto for
>>>making the
>>>L1 trace datapath wider to dump more micro-ops at once.
>>
>>thread 1 accesses memory at 1 MB
>>thread 2 accesses memory (concurrently) at 2 MB
>>
>>There are now 2 cache lines, and each thread has only made one access.
>>How is this not sharing the cache?
>
>We were talking about the "trace cache".  Stuff there is intermingled between
>the two threads but it doesn't look like two threads at all, just one big thread
>with a lot of micro-ops.
>
>The L1 data and L2 unified caches certainly contain stuff for both...
>
>>
>>Also, it is possible to widen the bus, but it is not practical. A wider internal
>>bus means more silicon means more complexity means more difficult to build in
>>small space. I will not say it is impossible, but at this time it is
>>impractical.
>
>If you do two caches you have the _same_ problem.  You have to have a data
>path from each...
>
>>
>>Consider: if a 128-bit bus were practical at this point in time, why would Intel
>>push SSE so hard when the latest greatest Pentium 4 only has a 64-bit internal
>>bus? It means that, at best, SSE instructions are only twice as fast as existing
>>scalar instructions. It also means that 3DNow! is just as fast as SSE.
>
>It is already practical.  As is a 256 bit bus.  See alpha.

Practical for Alpha because it was designed with that in mind. Not practical for
Intel because die real-estate is already costly, and at this point it would also
require significant modification of the Pentium 4 core itself.

According to Eugene, the point of HT is that they can extract more performance
without impeding profit margins.

I think it would be a good idea, but since Intel hasn't already done it, I
wouldn't expect them to do it in the future.

>>>>My comment, however, was based on the fact that HT shares the execution units
>>>>among threads. Neglecting memory accesses briefly, the worst-case performance on
>>>>a P4 is 33% of capacity (100% capacity = 3 u-ops/cycle). This means that at most
>>>>3 concurrent HT threads will max out the CPU. Of course, HT should actually
>>>>scale a little higher since the threads must all pause to access memory
>>>>occasionally. The question becomes, "How often do threads access memory?" A
>>>>single thread accesses memory quite often, but most accesses are cache hits.
>>>>
>>>
>>>Your math is _way_ wrong there.  If I do a memory read in one thread, the CPU
>>>is going to sit idle for hundreds of clocks.  There is _plenty_ of room to
>>>interlace
>>>other threads, particularly when most threads do memory reads here and there.
>>>
>>>Your statement is identical with saying that a modern processor should never be
>>>asked to
>>>run more than one process at a time because it has only one cpu to use.  Yet
>>>99.9% of
>>>those processes don't need the CPU for long periods of time...
>>
>>I'm not claiming it should never run more than one process. I'm just saying that
>>HT is going to hit diminishing returns very quickly. The first HT CPU only gets
>>30-40% added performance off of a CPU without HT. Adding a third logical CPU
>>will only be able to consume cycles when both logical HT CPUs are waiting on
>>memory. Adding a fourth only consumes the cycles when all 3 HT CPUs are waiting.
>>Etc.
>
>
>Yes, but with memory speeds, things are waiting _all_ the time.  Just like
>processes trying to smoke a disk spend 99.99% of their time waiting.
>
>>Also, I mentioned in another post that it does not take hundreds of cycles to
>>cache miss. It depends on the multiplier, bus latency of the RAS and CAS, and
>>locality of reference (whether or not RAS latency must be incurred). For my
>>Thunderbird 1.2 GHz (multiplier of 9), this is 90 clocks at most. For my
>>AthlonMP 1600 CPUs, neglecting the SMP factor, this is 105 clocks at most.
>
>All you have to do is run a couple of programs, and use the MSR stuff to
>time things.  You will see that most every cache miss _does_ take 300-400
>clock cycles, because of the way the addressing is done.  You have a very small
>window of time to pick up on that attempt at streaming data out of a RDRAM-type
>chip.  If you miss it, you start over.  It works fine for programs stomping
>serially thru memory.  But that is not a high-probability event.

If you use RDRAM, your latency will obviously be higher. The window isn't quite
as bad since RDRAM runs 400-533 MHz, though.

>>The one factor I didn't even consider with HT is that two processes may actually
>>impede performance if not scheduled carefully. If process X and Y are doing 2
>>different things, they will cache thrash and cause significantly more memory
>>accesses to incur the full penalty.
>
>Certainly possible.  However, with a threaded application this is not a
>problem, and that is where this seems to be at its best.  But it also works
>pretty well for separate applications of the right mix.
>
>Of course, running 4 compute-bound processes is not exactly efficient on today's
>machines anyway, due to context switching and cache invalidation...  but it
>still works efficiently enough to make everyone do it.

For two...but if four were vying for the small number of shared resources, I
would expect more cache thrashing. Eight will split that 8K L1 data cache among
eight threads. Sure, some will be shared data, but a -lot- of the data in L1 is
stack. If you start thrashing stack, your threads are going to be severely
penalized.

>>>>Here are some equations for computation of optimum number of HT CPUs:
>>>>effective_ipc = ave_ins_between_mem / (ave_ins_between_mem / ave_ipc +
>>>>mem_latency)
>>>>num_ht_cpus = max_ipc / effective_ipc
>>>>
>>>>Now, some P4/P4 Xeon numbers:
>>>>max_ipc = 3 u-ops/cycle
>>>>ave_ipc = 1 u-op/cycle (assuming worst-case)
>>>>ave_ins_between_mem = 3 u-ops/cycle (also lower than actual)
>>>>mem_latency = 2 cycles (for L1 cache)
>>>
>>>
>>>L1 cache isn't _the_ issue.  Real memory is.  And it is getting worse as 1gb is
>>>the normal
>>>memory size now.   that is a 2048 to 1 mapping to L2 cache, and far worse to L1.
>>> Since
>>>real programs access memory frequently, HT will _always_ work just as
>>>multiprogramming
>>>works on a single-cpu machine, because most threads are doing something that
>>>takes a long time
>>>to complete (I/O to disk or whatever for processes in an O/S, memory reads to a
>>>thread in the
>>>CPU.
>>
>>Real programs access the stack frequently. Only a small number of programs cache
>>thrash. Again, Intel isn't adding HT because of niche markets; they're adding HT
>>for the general market.
>
>_your_ "real program" might access the stack a lot.  The "real programs" I
>run do not.  So we are in different application worlds, which is ok.

Every function call you make is a stack access. Every local variable you use is
a stack access. Some gets stuck in registers, but to use more than 3 registers,
you incur more stack accesses. Inevitably, the register-starved x86 architecture
is always going back to the stack.

>>Also, I might add that the I/O burst cycle in the OS is much more skewed than
>>memory accesses at the assembly level. A single I/O instruction can take
>>thousands of clocks or more. Additional OS overhead makes it worse. Computation
>>in between each I/O isn't usually very long time-wise in comparison to
>>computation between memory accesses.
>
>They are within an order of magnitude.  Good disks today have 3.5ms delays.
>Good controllers with a big ram cache drop this by 50%.  But the idea is the
>same.  While waiting on one thread, do something for another if possible.  If
>it isn't possible, too bad.  If it is possible, there is a gain.
>
>Nobody says this is a "golden bullet" that fixes everything.  But it does
>make things go faster for the normal cases, just like a multiprogramming O/S
>does.  And just like the multiprogramming O/S, there are situations that will
>make it do _worse_ than if it just ran one process at a time, period.  But
>those are the _exception_ and not the rule, which is the point for HT.
>
>>>I don't see how this point is getting overlooked.  If SMT won't work, then an
>>>O/S that runs
>>>more processes than CPUs won't work.  No, it might not help if _all_ processes
>>>are screaming
>>>along in L1 cache, but that isn't realistic for real-world applications.  They
>>>are going to have
>>>a non-trivial memory bandwidth requirement and SMT lets other threads fill in
>>>the gaps when
>>>memory reads are hanging.
>>
>>Please show me where I said HT "won't work." I said HT "won't scale." That's a
>>completely different issue. Shared-bus multiprocessing works, but it doesn't
>>scale. I'm saying that HT is subject to the same limitations.
>
>
>We will just have to agree to disagree.  A couple of years will answer
>the question as to who is on track here...

I suppose. Perhaps within a couple of years, AMD will implement it as well.

>>>>effective_ipc = 3 / (3 / 1 + 6) = 0.33 ipc
>>>>num_ht_cpus = 3 / 0.33 = 9
>>>>
>>>>Making unrealistic assumptions that favor more HT CPUs only shows HT scaling to
>>>>9 CPUs before the threads start trampling over each other. Note that I also
>>>>unrealistically assumed that the CPU can execute all u-ops regardless of the
>>>>circumstances. I did assume that all cache accesses hit the L2, but accesses to
>>>>main memory are infrequent for general applications. This is a fair assumption
>>>>since Intel isn't marketting the P4 to people interested in chess; they're
>>>>marketting the P4 to many different people with many different applications.
>>>
>>>And you are making the assumption that all the threads are not requiring memory.
>>> If you
>>>run 9 threads, I'd bet that at any point in time less than 1/2 of them are ready
>>>to actually
>>>fire off micro-ops into the CPU core.  Most are going to be hung up waiting on
>>>the memory
>>>read unit to supply needed data into a register.
>>
>>That is most often not the case. If that were so, why doesn't HT see sweeping
>>gains instead of 30-40%? If the threads spent even half of their lives blocked
>>on memory, HT would see at least a 100% performance increase. I say "at least
>>100%" because HT also allows threads to concurrently execute to utilize more
>>execution bandwidth.
>>
>
>All you have to do is think about the question.  HT runs _two_ threads.  If
>_both_ are waiting on memory, nothing good will happen.  That is common.  On
>a multiprogramming O/S, if both threads do nothing but I/O there is no
>benefit.

If HT runs 2 threads and both stall on memory frequently, adding more threads
means waiting on memory will take even longer, and the additional threads won't
give a higher overall IPC.

If, in the other extreme case, threads stall on memory rarely, slightly higher
IPC can be extracted up to a limit (because not 100% of execution bandwidth can
be utilized per cycle by a single thread).

In either case, I don't see how what I've said (that HT will scale poorly) is
inaccurate.

Obviously the actual trend is skewed but near the middle since Intel estimates
30-40% speed-up. Since neither of the cases above hold completely true, HT might
scale higher than 2 logical CPUs. Maybe to 4. My opinion is that they won't hit
8 or beyond. I don't think Intel will see enough potential gain from 4 to
implement it. Four would be more awkward for the Pentium 4, anyway; it can only
execute up to 3 u-ops/cycle.

>>>>I might also point out that the number of instructions between memory accesses
>>>>is -usually- a good bit higher than 3. Some routines can go 20+ instructions in
>>>>between accessing memory. It all depends on what you're computing.
>>>
>>>Maybe.  But then again, once you do that memory access you sit for several
>>>hundred
>>>clock cycles.  Giving _plenty_ of spare time for other threads to run.  Which is
>>>the
>>>point, as I said.
>>
>>Around 100 for full penalty on both of my machines. Pentium 4 on RDRAM suffers
>>much worse, partly because the multiplier is enormous. Most cache misses won't
>>even incur the full penalty on well-written software.
>>
>
>
>You are not going to get memory data back in 100 cycles on a 3ghz machine.
>YOu are not going to get it back in 300 cycles...
>
>This from actual benchmark data, not from vendors claims...

Something like 80-270 cycles on a 3 GHz machine according to extrapolation from
data I measured earlier tonight which correlates closely to the timing figure
you posted (135 ns). The 270 cycles is assuming the worst case. If you don't
change row, you incur 80 cycles, much lower. Not great, but not as terrible as
300.

>>However, if threads are so often blocked, why does HT only see 30-40% speed
>>increase? As I recall, your figures showed only about 16% speed increase.
>>Granted you did not have the pause instruction, but you also say that you don't
>>lock very often.
>
>I answered that already.  Same problem with two I/O bound processes.
>
>>>>Also, most threads get more than 1 u-op/cycle in the execution unit. This
>>>>starves other threads. If you recompute the above making the assumption that a
>>>>thread can fully utilize the CPU with an ave_ipc of 3.0, it makes HT less useful
>>>>(ideal number = 7.0 instead of 9.0). This is no suprise since HT is designed
>>>>around the central idea that an application cannot fully utilize 3 u-ops/cycle
>>>>execution bandwidth.
>>>>
>>>
>>>
>>>I won't disagree there, because their basic assumption is perfectly valid.
>>>Running nine
>>>small compute-bound tasks is not very realistic.  Running 20 large applications
>>>banging
>>>on memory makes much more sense...
>>>
>>>And there HT will smoke.
>>>
>>>>As a final disclaimer, I know that this model used a lot of flawed numbers.
>>>>However, all but one favored proliferation of HT. The only figure that would
>>>>allow HT to scale efficiently to higher numbers is the latency of a memory
>>>>access. If this were significant at all for general applications, turning off
>>>>the cache wouldn't make such a huge difference since general applications would
>>>>be cache missing often already.
>>>
>>>I don't understand the statement.  Latency is _the_ issue today.  It will be 2x
>>>worse when
>>>cpus go 2x faster tomorrow.  On a single memory read today, which takes on the
>>>order of
>>>120ns absolute best case, a 3ghz processor will wait for 360 clocks.  And since
>>>the 2x
>>>integer units are double-clocked, we miss the opportunity to execute at _least_
>>>720
>>>micro-ops per integer unit.  That leaves a _lot_ of time to factor in other
>>>threads to eat
>>>up those idle clock cycles...
>>
>>120ns is worst case.
>
>No it is the _typical_ case.  Just run a real random-access memory pattern.
>You are not going to beat 120ns with _any_ DRAM-based memory system.

Random-access for me was 138 ns. Serial access is 37 ns.

>>My admission was that if applications incur memory latency -often-, it will make
>>HT scale well. My point was that this doesn't happen with general applications.
>
>My point is that it happens in _all_ applications.  Which hurts the scaling,
>since there is no way to speed up those parallel memory references since memory
>is a serial device.

For one, any sort of vector manipulations take time, fit snugly in a P4 cache
line, and are usually accessed serially allowing prefetching. Video/audio data
can almost -always- be accessed serially. Databases are practically RDRAM's
flagship. Most compute-bound applications I can think of off the top of my head
don't suffer from repeated random memory probes. Serial memory probes aren't
nearly as slow since you don't incur the full penalty, can prefetch, and get the
benefit of the cache.

-Matt
Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading Robert Hyatt 07:00:23 12/17/02
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.