Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Eugene Nalimov

Date: 11:39:16 12/15/02

Go up one level in this thread


I have seen results of the special tools that analyzed memory accesses on
reasonable large (32 CPU) NUMA system running some database application. Memory
latency for "non-local" accesses, according to the specifications, is somewhat
like 200 cycles.

In real-world case, with lot of bus traffic, special ("very rare") cases not
mentioned in the documentation, etc. worst case latency was slightly more than
700 cycles.

And of course CPU frequence was much less than 3GHz.

Thanks,
Eugene

On December 15, 2002 at 09:39:17, Robert Hyatt wrote:

>On December 15, 2002 at 02:18:44, Matt Taylor wrote:
>
>>On December 14, 2002 at 13:38:25, Robert Hyatt wrote:
>>
>>>On December 14, 2002 at 02:01:20, Matt Taylor wrote:
>>>
>>>>On December 13, 2002 at 22:51:35, Robert Hyatt wrote:
>>>>
>>>>>On December 13, 2002 at 21:55:09, Matt Taylor wrote:
>>>>>
>>>>>><snip>
>>>>>>>>the problem is you lose time to the ECC and registered features of the
>>>>>>>>memory you need for the dual. of course that's the case for all duals.
>>>>>>>>both K7 MP and Xeon suffer from that regrettably.
>>>>>>>
>>>>>>>That is not true.  The duals do _not_ have to have ECC ram.  And it doesn't
>>>>>>>appear to be
>>>>>>>any slower than non-ECC ram although I will be able to test that before long as
>>>>>>>we have
>>>>>>>some non-ECC machines coming in.
>>>>>>
>>>>>>Actually he is correct about the registered ram. The "registered" feature is
>>>>>>that it delays longer than unregistered ram. This is important for stability. It
>>>>>>doesn't affect bandwidth, but it does affect latency.
>>>>>>
>>>>>><snip>
>>>>>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved
>>>>>>>>the SMT another bit.
>>>>>>>>
>>>>>>>>Seems to me they working for years to get that SMT/HT slowly better working.
>>>>>>>
>>>>>>>Not "for years".  It was announced as a coming thing a couple of years ago and
>>>>>>>several
>>>>>>>vendors have been discussing the idea.  And they are going to increase the ratio
>>>>>>>of physical
>>>>>>>to logical cpus before long also...
>>>>>>
>>>>>>I don't think so. HT won't scale terribly well. I made another post about that,
>>>>>>and I won't reiterate what I said there.
>>>>>>
>>>>>>-Matt
>>>>>
>>>>>
>>>>>I don't see why the two of you make such sweeping generalizations.  What is
>>>>>to prevent modifying the L1 cache to spit out 256 bits of data at once?  There
>>>>>is nothing internally that can't be improved over time, and the idea of a
>>>>>4-way hyper-threaded cpu should eventually be just as effective as four
>>>>>completely separate cpus, although the price should be substantially lower.
>>>>
>>>>It's not really a "sweeping generalization." I was quite specific. I said HT
>>>>won't scale very well.
>>>
>>>And that _is_ a sweeping generalization.  HT can potentially scale just as well
>>>as adding additional CPUs can scale, and probably better because the logical
>>>cpus
>>>are so tightly coupled that there is a savings in hardware space/cost that makes
>>>them
>>>very attractive.
>>>
>>>
>>>>
>>>>Currently, cost will prohibit the scaling of HT. You can argue that R&D will
>>>>eventually make it cost-effective, but that doesn't mean it will happen
>>>>tomorrow. Also, complexity of sharing the cache among threads, the size of the
>>>>cache, and the width of the cache all contribute to cost and overall chip
>>>>complexity. Internally, IA-32 processors are only 64-bits wide. This is easy to
>>>>verify. Throughput for single-precision floating-point SSE operations is only
>>>>twice the throughput for the SISD FPU.
>>>
>>>The cache isn't really shared among threads.   So far as the cache is concerned
>>>there are
>>>no threads, just data to feed to the cpu...  Also, nothing says that IA32 will
>>>remain on a
>>>64 bit internal processor bus.  Just look at the evolution from the first 32 bit
>>>processor
>>>to today.  And those changes are transparent to the outside world so that they
>>>can be
>>>done by Intel when they choose without affecting existing code.  Ditto for
>>>making the
>>>L1 trace datapath wider to dump more micro-ops at once.
>>
>>thread 1 accesses memory at 1 MB
>>thread 2 accesses memory (concurrently) at 2 MB
>>
>>There are now 2 cache lines, and each thread has only made one access.
>>How is this not sharing the cache?
>
>We were talking about the "trace cache".  Stuff there is intermingled between
>the two threads but it doesn't look like two threads at all, just one big thread
>with a lot of micro-ops.
>
>The L1 data and L2 unified caches certainly contain stuff for both...
>
>>
>>Also, it is possible to widen the bus, but it is not practical. A wider internal
>>bus means more silicon means more complexity means more difficult to build in
>>small space. I will not say it is impossible, but at this time it is
>>impractical.
>
>If you do two caches you have the _same_ problem.  You have to have a data
>path from each...
>
>>
>>Consider: if a 128-bit bus were practical at this point in time, why would Intel
>>push SSE so hard when the latest greatest Pentium 4 only has a 64-bit internal
>>bus? It means that, at best, SSE instructions are only twice as fast as existing
>>scalar instructions. It also means that 3DNow! is just as fast as SSE.
>
>
>
>It is already practical.  As is a 256 bit bus.  See alpha.
>
>
>
>
>>
>>>>My comment, however, was based on the fact that HT shares the execution units
>>>>among threads. Neglecting memory accesses briefly, the worst-case performance on
>>>>a P4 is 33% of capacity (100% capacity = 3 u-ops/cycle). This means that at most
>>>>3 concurrent HT threads will max out the CPU. Of course, HT should actually
>>>>scale a little higher since the threads must all pause to access memory
>>>>occasionally. The question becomes, "How often do threads access memory?" A
>>>>single thread accesses memory quite often, but most accesses are cache hits.
>>>>
>>>
>>>Your math is _way_ wrong there.  If I do a memory read in one thread, the CPU
>>>is going to sit idle for hundreds of clocks.  There is _plenty_ of room to
>>>interlace
>>>other threads, particularly when most threads do memory reads here and there.
>>>
>>>Your statement is identical with saying that a modern processor should never be
>>>asked to
>>>run more than one process at a time because it has only one cpu to use.  Yet
>>>99.9% of
>>>those processes don't need the CPU for long periods of time...
>>
>>I'm not claiming it should never run more than one process. I'm just saying that
>>HT is going to hit diminishing returns very quickly. The first HT CPU only gets
>>30-40% added performance off of a CPU without HT. Adding a third logical CPU
>>will only be able to consume cycles when both logical HT CPUs are waiting on
>>memory. Adding a fourth only consumes the cycles when all 3 HT CPUs are waiting.
>>Etc.
>
>
>Yes, but with memory speeds, things are waiting _all_ the time.  Just like
>processes trying to smoke a disk spend 99.99% of their time waiting.
>
>
>
>
>>
>>Also, I mentioned in another post that it does not take hundreds of cycles to
>>cache miss. It depends on the multiplier, bus latency of the RAS and CAS, and
>>locality of reference (whether or not RAS latency must be incurred). For my
>>Thunderbird 1.2 GHz (multiplier of 9), this is 90 clocks at most. For my
>>AthlonMP 1600 CPUs, neglecting the SMP factor, this is 105 clocks at most.
>
>All you have to do is run a couple of programs, and use the MSR stuff to
>time things.  You will see that most every cache miss _does_ take 300-400
>clock cycles, because of the way the addressing is done.  You have a very small
>window of time to pick up on that attempt at streaming data out of a RDRAM-type
>chip.  If you miss it, you start over.  It works fine for programs stomping
>serially thru memory.  But that is not a high-probability event.
>
>
>
>
>>
>>The one factor I didn't even consider with HT is that two processes may actually
>>impede performance if not scheduled carefully. If process X and Y are doing 2
>>different things, they will cache thrash and cause significantly more memory
>>accesses to incur the full penalty.
>
>Certainly possible.  However, with a threaded application this is not a
>problem, and that is where this seems to be at its best.  But it also works
>pretty well for separate applications of the right mix.
>
>Of course, running 4 compute-bound processes is not exactly efficient on today's
>machines anyway, due to context switching and cache invalidation...  but it
>still works efficiently enough to make everyone do it.
>
>>
>>>>Here are some equations for computation of optimum number of HT CPUs:
>>>>effective_ipc = ave_ins_between_mem / (ave_ins_between_mem / ave_ipc +
>>>>mem_latency)
>>>>num_ht_cpus = max_ipc / effective_ipc
>>>>
>>>>Now, some P4/P4 Xeon numbers:
>>>>max_ipc = 3 u-ops/cycle
>>>>ave_ipc = 1 u-op/cycle (assuming worst-case)
>>>>ave_ins_between_mem = 3 u-ops/cycle (also lower than actual)
>>>>mem_latency = 2 cycles (for L1 cache)
>>>
>>>
>>>L1 cache isn't _the_ issue.  Real memory is.  And it is getting worse as 1gb is
>>>the normal
>>>memory size now.   that is a 2048 to 1 mapping to L2 cache, and far worse to L1.
>>> Since
>>>real programs access memory frequently, HT will _always_ work just as
>>>multiprogramming
>>>works on a single-cpu machine, because most threads are doing something that
>>>takes a long time
>>>to complete (I/O to disk or whatever for processes in an O/S, memory reads to a
>>>thread in the
>>>CPU.
>>
>>Real programs access the stack frequently. Only a small number of programs cache
>>thrash. Again, Intel isn't adding HT because of niche markets; they're adding HT
>>for the general market.
>
>_your_ "real program" might access the stack a lot.  The "real programs" I
>run do not.  So we are in different application worlds, which is ok.
>
>
>
>
>
>>
>>Also, I might add that the I/O burst cycle in the OS is much more skewed than
>>memory accesses at the assembly level. A single I/O instruction can take
>>thousands of clocks or more. Additional OS overhead makes it worse. Computation
>>in between each I/O isn't usually very long time-wise in comparison to
>>computation between memory accesses.
>
>They are within an order of magnitude.  Good disks today have 3.5ms delays.
>Good controllers with a big ram cache drop this by 50%.  But the idea is the
>same.  While waiting on one thread, do something for another if possible.  If
>it isn't possible, too bad.  If it is possible, there is a gain.
>
>Nobody says this is a "golden bullet" that fixes everything.  But it does
>make things go faster for the normal cases, just like a multiprogramming O/S
>does.  And just like the multiprogramming O/S, there are situations that will
>make it do _worse_ than if it just ran one process at a time, period.  But
>those are the _exception_ and not the rule, which is the point for HT.
>
>
>
>
>
>
>>
>>>I don't see how this point is getting overlooked.  If SMT won't work, then an
>>>O/S that runs
>>>more processes than CPUs won't work.  No, it might not help if _all_ processes
>>>are screaming
>>>along in L1 cache, but that isn't realistic for real-world applications.  They
>>>are going to have
>>>a non-trivial memory bandwidth requirement and SMT lets other threads fill in
>>>the gaps when
>>>memory reads are hanging.
>>
>>Please show me where I said HT "won't work." I said HT "won't scale." That's a
>>completely different issue. Shared-bus multiprocessing works, but it doesn't
>>scale. I'm saying that HT is subject to the same limitations.
>
>
>We will just have to agree to disagree.  A couple of years will answer
>the question as to who is on track here...
>
>
>
>
>>
>>>>effective_ipc = 3 / (3 / 1 + 6) = 0.33 ipc
>>>>num_ht_cpus = 3 / 0.33 = 9
>>>>
>>>>Making unrealistic assumptions that favor more HT CPUs only shows HT scaling to
>>>>9 CPUs before the threads start trampling over each other. Note that I also
>>>>unrealistically assumed that the CPU can execute all u-ops regardless of the
>>>>circumstances. I did assume that all cache accesses hit the L2, but accesses to
>>>>main memory are infrequent for general applications. This is a fair assumption
>>>>since Intel isn't marketting the P4 to people interested in chess; they're
>>>>marketting the P4 to many different people with many different applications.
>>>
>>>And you are making the assumption that all the threads are not requiring memory.
>>> If you
>>>run 9 threads, I'd bet that at any point in time less than 1/2 of them are ready
>>>to actually
>>>fire off micro-ops into the CPU core.  Most are going to be hung up waiting on
>>>the memory
>>>read unit to supply needed data into a register.
>>
>>That is most often not the case. If that were so, why doesn't HT see sweeping
>>gains instead of 30-40%? If the threads spent even half of their lives blocked
>>on memory, HT would see at least a 100% performance increase. I say "at least
>>100%" because HT also allows threads to concurrently execute to utilize more
>>execution bandwidth.
>>
>
>All you have to do is think about the question.  HT runs _two_ threads.  If
>_both_ are waiting on memory, nothing good will happen.  That is common.  On
>a multiprogramming O/S, if both threads do nothing but I/O there is no
>benefit.
>
>
>
>
>>>>I might also point out that the number of instructions between memory accesses
>>>>is -usually- a good bit higher than 3. Some routines can go 20+ instructions in
>>>>between accessing memory. It all depends on what you're computing.
>>>
>>>Maybe.  But then again, once you do that memory access you sit for several
>>>hundred
>>>clock cycles.  Giving _plenty_ of spare time for other threads to run.  Which is
>>>the
>>>point, as I said.
>>
>>Around 100 for full penalty on both of my machines. Pentium 4 on RDRAM suffers
>>much worse, partly because the multiplier is enormous. Most cache misses won't
>>even incur the full penalty on well-written software.
>>
>
>
>You are not going to get memory data back in 100 cycles on a 3ghz machine.
>YOu are not going to get it back in 300 cycles...
>
>This from actual benchmark data, not from vendors claims...
>
>
>>However, if threads are so often blocked, why does HT only see 30-40% speed
>>increase? As I recall, your figures showed only about 16% speed increase.
>>Granted you did not have the pause instruction, but you also say that you don't
>>lock very often.
>
>
>I answered that already.  Same problem with two I/O bound processes.
>
>>
>>>>Also, most threads get more than 1 u-op/cycle in the execution unit. This
>>>>starves other threads. If you recompute the above making the assumption that a
>>>>thread can fully utilize the CPU with an ave_ipc of 3.0, it makes HT less useful
>>>>(ideal number = 7.0 instead of 9.0). This is no suprise since HT is designed
>>>>around the central idea that an application cannot fully utilize 3 u-ops/cycle
>>>>execution bandwidth.
>>>>
>>>
>>>
>>>I won't disagree there, because their basic assumption is perfectly valid.
>>>Running nine
>>>small compute-bound tasks is not very realistic.  Running 20 large applications
>>>banging
>>>on memory makes much more sense...
>>>
>>>And there HT will smoke.
>>>
>>>>As a final disclaimer, I know that this model used a lot of flawed numbers.
>>>>However, all but one favored proliferation of HT. The only figure that would
>>>>allow HT to scale efficiently to higher numbers is the latency of a memory
>>>>access. If this were significant at all for general applications, turning off
>>>>the cache wouldn't make such a huge difference since general applications would
>>>>be cache missing often already.
>>>
>>>I don't understand the statement.  Latency is _the_ issue today.  It will be 2x
>>>worse when
>>>cpus go 2x faster tomorrow.  On a single memory read today, which takes on the
>>>order of
>>>120ns absolute best case, a 3ghz processor will wait for 360 clocks.  And since
>>>the 2x
>>>integer units are double-clocked, we miss the opportunity to execute at _least_
>>>720
>>>micro-ops per integer unit.  That leaves a _lot_ of time to factor in other
>>>threads to eat
>>>up those idle clock cycles...
>>
>>120ns is worst case.
>
>No it is the _typical_ case.  Just run a real random-access memory pattern.
>You are not going to beat 120ns with _any_ DRAM-based memory system.
>
>
>>
>>My admission was that if applications incur memory latency -often-, it will make
>>HT scale well. My point was that this doesn't happen with general applications.
>
>
>
>My point is that it happens in _all_ applications.  Which hurts the scaling,
>since there is no way to speed up those parallel memory references since memory
>is a serial device.
>
>
>
>>
>>>>>There is _nothing_ inherently wrong with taking the process scheduler completely
>>>>>out of the O/S and dropping it into the CPU.  It makes perfect sense as the
>>>>>operating system context-switching time is so long that we can't afford to
>>>>>block processes for memory accesses, and have to block only for things that
>>>>>take far longer like I/O operations.  Inside the cpu this context switching
>>>>>time becomes nearly zero, and the gains are significant.
>>>>>
>>>>>It might take a while to get there, but you will see 16-way SMT one day.  Just
>>>>>as surely as you will one day see 4 cpus on a single chip...
>>>>
>>>>HT isn't about context-switching. It's about parallelism and concurrency.
>>>>Threads in HT are concurrent. HT has as much to do with process scheduling as
>>>>SMP does -- just about nil.
>>>
>>>
>>>Absolutely incorrect.  On a single cpu machine, the operating system gets to
>>>switch
>>>back and forth between two compute bound processes.  On a hyper-threaded cpu
>>>this
>>>does _not_ happen.  The cpu schedules _both_ processes and lets the cpu
>>>interlace them
>>>much more efficiently.
>>>
>>>That ought to be obvious if you have any O/S experience...
>>
>>In either case, the OS has a ready queue. On a single-CPU machine, it dequeues a
>>process and executes it. At the end of the timeslice, if the process is still
>>compute-bound, it requeues. It then dequeues the next, executes, and repeats the
>>cycle.
>>
>>In the case of HT, the OS again has its ready queue. It's going to dequeue 2
>>processes and schedule both. That sounds more complex if you ask me. It gets
>>worse since the scheduler now has to take into account (1) Locks to protect the
>>ready queue and (2) the HT factor since you don't want both to be compute-bound.
>>
>>In my implementation, I've avoided locks at all costs and used seperate ready
>>queues for each CPU. I do load balancing after x quanta, where x is an
>>adjustable constant. This helps simplify my system. Still, I don't see how this
>>is easier than the case of a single-CPU scheduler which does not have to lock
>>and only has to pick 1 process for each quantum.
>>
>>Now, what is obvious to me is that without HT, the scheduler picks 1 process for
>>each quantum. With HT, just as with MP, the scheduler picks 2 processes for each
>>quantum. I really must disagree and say that HT makes the scheduler more
>>complex. I've never heard of an MP system being simpler than a non-MP system.
>>
>>>>You keep touting that the processor can do things while a thread blocks on
>>>>memory, but this is extremely rare. HT is really to take advantage of execution
>>>>units not normally utilized. While the memory aspect is appealing for a guy
>>>>designing a chess engine, Intel didn't add HT to appeal to a niche crowd. In the
>>>>general case, HT isn't going to do a lot for you.
>>>
>>>
>>>The _first_ implementation is not doing badly.  The _second_ generation will be
>>>better.  This is not a "new topic" it has been discussed in computer
>>>architecture forums
>>>for 10 years now.  Only recently has the chip density reached the point where it
>>>became
>>>possible to actually do this.  It won't be too long before it is just as
>>>standard as the
>>>super-scalar architectures are today.  10 years ago many were saying "this won't
>>>work (super-scalar)."  It _does_ work and it works well as the kinks are ironed
>>>out.
>>>HT/SMT is nothing different.  And it will be just as normal in a few years on
>>>all
>>>processors as super-scalar is today.
>>
>>Yes, it was some 3 years ago when Intel began discussing implementation. I
>>thought it was a good idea then, and I still think it is a good idea. However, I
>>don't see HT scaling very far past 2 CPUs. You can add lots of logical CPUs to a
>>single core, but you won't gain anything over a smaller, simpler design with a
>>small number of logical CPUs. I can't see HT gaining anything significant
>>between 4 and 8 logical CPUs.
>>
>>-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.