Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Robert Hyatt
Date: 07:00:23 12/17/02
On December 17, 2002 at 06:28:38, Matt Taylor wrote:

>On December 15, 2002 at 09:39:17, Robert Hyatt wrote:
>
>>On December 15, 2002 at 02:18:44, Matt Taylor wrote:
>>
>>>On December 14, 2002 at 13:38:25, Robert Hyatt wrote:
>>>
>>>>On December 14, 2002 at 02:01:20, Matt Taylor wrote:
>>>>
>>>>>On December 13, 2002 at 22:51:35, Robert Hyatt wrote:
>>>>>
>>>>>>On December 13, 2002 at 21:55:09, Matt Taylor wrote:
>>>>>>
>>>>>>><snip>
>>>>>>>>>the problem is you lose time to the ECC and registered features of the
>>>>>>>>>memory you need for the dual. of course that's the case for all duals.
>>>>>>>>>both K7 MP and Xeon suffer from that regrettably.
>>>>>>>>
>>>>>>>>That is not true.  The duals do _not_ have to have ECC ram.  And it doesn't
>>>>>>>>appear to be
>>>>>>>>any slower than non-ECC ram although I will be able to test that before long as
>>>>>>>>we have
>>>>>>>>some non-ECC machines coming in.
>>>>>>>
>>>>>>>Actually he is correct about the registered ram. The "registered" feature is
>>>>>>>that it delays longer than unregistered ram. This is important for stability. It
>>>>>>>doesn't affect bandwidth, but it does affect latency.
>>>>>>>
>>>>>>><snip>
>>>>>>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved
>>>>>>>>>the SMT another bit.
>>>>>>>>>
>>>>>>>>>Seems to me they working for years to get that SMT/HT slowly better working.
>>>>>>>>
>>>>>>>>Not "for years".  It was announced as a coming thing a couple of years ago and
>>>>>>>>several
>>>>>>>>vendors have been discussing the idea.  And they are going to increase the ratio
>>>>>>>>of physical
>>>>>>>>to logical cpus before long also...
>>>>>>>
>>>>>>>I don't think so. HT won't scale terribly well. I made another post about that,
>>>>>>>and I won't reiterate what I said there.
>>>>>>>
>>>>>>>-Matt
>>>>>>
>>>>>>
>>>>>>I don't see why the two of you make such sweeping generalizations.  What is
>>>>>>to prevent modifying the L1 cache to spit out 256 bits of data at once?  There
>>>>>>is nothing internally that can't be improved over time, and the idea of a
>>>>>>4-way hyper-threaded cpu should eventually be just as effective as four
>>>>>>completely separate cpus, although the price should be substantially lower.
>>>>>
>>>>>It's not really a "sweeping generalization." I was quite specific. I said HT
>>>>>won't scale very well.
>>>>
>>>>And that _is_ a sweeping generalization.  HT can potentially scale just as well
>>>>as adding additional CPUs can scale, and probably better because the logical
>>>>cpus
>>>>are so tightly coupled that there is a savings in hardware space/cost that makes
>>>>them
>>>>very attractive.
>>>>
>>>>
>>>>>
>>>>>Currently, cost will prohibit the scaling of HT. You can argue that R&D will
>>>>>eventually make it cost-effective, but that doesn't mean it will happen
>>>>>tomorrow. Also, complexity of sharing the cache among threads, the size of the
>>>>>cache, and the width of the cache all contribute to cost and overall chip
>>>>>complexity. Internally, IA-32 processors are only 64-bits wide. This is easy to
>>>>>verify. Throughput for single-precision floating-point SSE operations is only
>>>>>twice the throughput for the SISD FPU.
>>>>
>>>>The cache isn't really shared among threads.   So far as the cache is concerned
>>>>there are
>>>>no threads, just data to feed to the cpu...  Also, nothing says that IA32 will
>>>>remain on a
>>>>64 bit internal processor bus.  Just look at the evolution from the first 32 bit
>>>>processor
>>>>to today.  And those changes are transparent to the outside world so that they
>>>>can be
>>>>done by Intel when they choose without affecting existing code.  Ditto for
>>>>making the
>>>>L1 trace datapath wider to dump more micro-ops at once.
>>>
>>>thread 1 accesses memory at 1 MB
>>>thread 2 accesses memory (concurrently) at 2 MB
>>>
>>>There are now 2 cache lines, and each thread has only made one access.
>>>How is this not sharing the cache?
>>
>>We were talking about the "trace cache".  Stuff there is intermingled between
>>the two threads but it doesn't look like two threads at all, just one big thread
>>with a lot of micro-ops.
>>
>>The L1 data and L2 unified caches certainly contain stuff for both...
>>
>>>
>>>Also, it is possible to widen the bus, but it is not practical. A wider internal
>>>bus means more silicon means more complexity means more difficult to build in
>>>small space. I will not say it is impossible, but at this time it is
>>>impractical.
>>
>>If you do two caches you have the _same_ problem.  You have to have a data
>>path from each...
>>
>>>
>>>Consider: if a 128-bit bus were practical at this point in time, why would Intel
>>>push SSE so hard when the latest greatest Pentium 4 only has a 64-bit internal
>>>bus? It means that, at best, SSE instructions are only twice as fast as existing
>>>scalar instructions. It also means that 3DNow! is just as fast as SSE.
>>
>>It is already practical.  As is a 256 bit bus.  See alpha.
>
>Practical for Alpha because it was designed with that in mind. Not practical for
>Intel because die real-estate is already costly, and at this point it would also
>require significant modification of the Pentium 4 core itself.
>
>According to Eugene, the point of HT is that they can extract more performance
>without impeding profit margins.
>
>I think it would be a good idea, but since Intel hasn't already done it, I
>wouldn't expect them to do it in the future.

I'd almost guarantee they will do it in the future.  Fortunately they use
a silicon compiler to lay this stuff out and let it deal with the details of
what goes where to minimize crossings, etc.  They won't likely ignore a
bottleneck that opens up even more performance without having to completely
re-design the CPU core...  Widening an interstate is _far_ cheaper than doing
something different like a light rail system or a subway.


>
>>>>>My comment, however, was based on the fact that HT shares the execution units
>>>>>among threads. Neglecting memory accesses briefly, the worst-case performance on
>>>>>a P4 is 33% of capacity (100% capacity = 3 u-ops/cycle). This means that at most
>>>>>3 concurrent HT threads will max out the CPU. Of course, HT should actually
>>>>>scale a little higher since the threads must all pause to access memory
>>>>>occasionally. The question becomes, "How often do threads access memory?" A
>>>>>single thread accesses memory quite often, but most accesses are cache hits.
>>>>>
>>>>
>>>>Your math is _way_ wrong there.  If I do a memory read in one thread, the CPU
>>>>is going to sit idle for hundreds of clocks.  There is _plenty_ of room to
>>>>interlace
>>>>other threads, particularly when most threads do memory reads here and there.
>>>>
>>>>Your statement is identical with saying that a modern processor should never be
>>>>asked to
>>>>run more than one process at a time because it has only one cpu to use.  Yet
>>>>99.9% of
>>>>those processes don't need the CPU for long periods of time...
>>>
>>>I'm not claiming it should never run more than one process. I'm just saying that
>>>HT is going to hit diminishing returns very quickly. The first HT CPU only gets
>>>30-40% added performance off of a CPU without HT. Adding a third logical CPU
>>>will only be able to consume cycles when both logical HT CPUs are waiting on
>>>memory. Adding a fourth only consumes the cycles when all 3 HT CPUs are waiting.
>>>Etc.
>>
>>
>>Yes, but with memory speeds, things are waiting _all_ the time.  Just like
>>processes trying to smoke a disk spend 99.99% of their time waiting.
>>
>>>Also, I mentioned in another post that it does not take hundreds of cycles to
>>>cache miss. It depends on the multiplier, bus latency of the RAS and CAS, and
>>>locality of reference (whether or not RAS latency must be incurred). For my
>>>Thunderbird 1.2 GHz (multiplier of 9), this is 90 clocks at most. For my
>>>AthlonMP 1600 CPUs, neglecting the SMP factor, this is 105 clocks at most.
>>
>>All you have to do is run a couple of programs, and use the MSR stuff to
>>time things.  You will see that most every cache miss _does_ take 300-400
>>clock cycles, because of the way the addressing is done.  You have a very small
>>window of time to pick up on that attempt at streaming data out of a RDRAM-type
>>chip.  If you miss it, you start over.  It works fine for programs stomping
>>serially thru memory.  But that is not a high-probability event.
>
>If you use RDRAM, your latency will obviously be higher. The window isn't quite
>as bad since RDRAM runs 400-533 MHz, though.

I've never bought a machine with RDRAM.  I tested a couple but except for the
huge array codes that had a big memory bandwidth requirement, everything else
looked somewhat worse.


>
>>>The one factor I didn't even consider with HT is that two processes may actually
>>>impede performance if not scheduled carefully. If process X and Y are doing 2
>>>different things, they will cache thrash and cause significantly more memory
>>>accesses to incur the full penalty.
>>
>>Certainly possible.  However, with a threaded application this is not a
>>problem, and that is where this seems to be at its best.  But it also works
>>pretty well for separate applications of the right mix.
>>
>>Of course, running 4 compute-bound processes is not exactly efficient on today's
>>machines anyway, due to context switching and cache invalidation...  but it
>>still works efficiently enough to make everyone do it.
>
>For two...but if four were vying for the small number of shared resources, I
>would expect more cache thrashing. Eight will split that 8K L1 data cache among
>eight threads. Sure, some will be shared data, but a -lot- of the data in L1 is
>stack. If you start thrashing stack, your threads are going to be severely
>penalized.

Assuming you use a lot of stack space, of course.  Plenty of programs don't.

And L2 certainly helps minimize the effect of the small L1 data cache.


>
>>>>>Here are some equations for computation of optimum number of HT CPUs:
>>>>>effective_ipc = ave_ins_between_mem / (ave_ins_between_mem / ave_ipc +
>>>>>mem_latency)
>>>>>num_ht_cpus = max_ipc / effective_ipc
>>>>>
>>>>>Now, some P4/P4 Xeon numbers:
>>>>>max_ipc = 3 u-ops/cycle
>>>>>ave_ipc = 1 u-op/cycle (assuming worst-case)
>>>>>ave_ins_between_mem = 3 u-ops/cycle (also lower than actual)
>>>>>mem_latency = 2 cycles (for L1 cache)
>>>>
>>>>
>>>>L1 cache isn't _the_ issue.  Real memory is.  And it is getting worse as 1gb is
>>>>the normal
>>>>memory size now.   that is a 2048 to 1 mapping to L2 cache, and far worse to L1.
>>>> Since
>>>>real programs access memory frequently, HT will _always_ work just as
>>>>multiprogramming
>>>>works on a single-cpu machine, because most threads are doing something that
>>>>takes a long time
>>>>to complete (I/O to disk or whatever for processes in an O/S, memory reads to a
>>>>thread in the
>>>>CPU.
>>>
>>>Real programs access the stack frequently. Only a small number of programs cache
>>>thrash. Again, Intel isn't adding HT because of niche markets; they're adding HT
>>>for the general market.
>>
>>_your_ "real program" might access the stack a lot.  The "real programs" I
>>run do not.  So we are in different application worlds, which is ok.
>
>Every function call you make is a stack access. Every local variable you use is
>a stack access. Some gets stuck in registers, but to use more than 3 registers,
>you incur more stack accesses. Inevitably, the register-starved x86 architecture
>is always going back to the stack.

What if your programs don't have much local data.  Just a couple of loop
counters?  This is pretty typical for heavy-duty applications as you can't
put 1000 x 1000 x 50 arrays on the stack.  Nor would you want to, generally.



>
>>>Also, I might add that the I/O burst cycle in the OS is much more skewed than
>>>memory accesses at the assembly level. A single I/O instruction can take
>>>thousands of clocks or more. Additional OS overhead makes it worse. Computation
>>>in between each I/O isn't usually very long time-wise in comparison to
>>>computation between memory accesses.
>>
>>They are within an order of magnitude.  Good disks today have 3.5ms delays.
>>Good controllers with a big ram cache drop this by 50%.  But the idea is the
>>same.  While waiting on one thread, do something for another if possible.  If
>>it isn't possible, too bad.  If it is possible, there is a gain.
>>
>>Nobody says this is a "golden bullet" that fixes everything.  But it does
>>make things go faster for the normal cases, just like a multiprogramming O/S
>>does.  And just like the multiprogramming O/S, there are situations that will
>>make it do _worse_ than if it just ran one process at a time, period.  But
>>those are the _exception_ and not the rule, which is the point for HT.
>>
>>>>I don't see how this point is getting overlooked.  If SMT won't work, then an
>>>>O/S that runs
>>>>more processes than CPUs won't work.  No, it might not help if _all_ processes
>>>>are screaming
>>>>along in L1 cache, but that isn't realistic for real-world applications.  They
>>>>are going to have
>>>>a non-trivial memory bandwidth requirement and SMT lets other threads fill in
>>>>the gaps when
>>>>memory reads are hanging.
>>>
>>>Please show me where I said HT "won't work." I said HT "won't scale." That's a
>>>completely different issue. Shared-bus multiprocessing works, but it doesn't
>>>scale. I'm saying that HT is subject to the same limitations.
>>
>>
>>We will just have to agree to disagree.  A couple of years will answer
>>the question as to who is on track here...
>
>I suppose. Perhaps within a couple of years, AMD will implement it as well.

I'd bet on it, myself...


>
>>>>>effective_ipc = 3 / (3 / 1 + 6) = 0.33 ipc
>>>>>num_ht_cpus = 3 / 0.33 = 9
>>>>>
>>>>>Making unrealistic assumptions that favor more HT CPUs only shows HT scaling to
>>>>>9 CPUs before the threads start trampling over each other. Note that I also
>>>>>unrealistically assumed that the CPU can execute all u-ops regardless of the
>>>>>circumstances. I did assume that all cache accesses hit the L2, but accesses to
>>>>>main memory are infrequent for general applications. This is a fair assumption
>>>>>since Intel isn't marketting the P4 to people interested in chess; they're
>>>>>marketting the P4 to many different people with many different applications.
>>>>
>>>>And you are making the assumption that all the threads are not requiring memory.
>>>> If you
>>>>run 9 threads, I'd bet that at any point in time less than 1/2 of them are ready
>>>>to actually
>>>>fire off micro-ops into the CPU core.  Most are going to be hung up waiting on
>>>>the memory
>>>>read unit to supply needed data into a register.
>>>
>>>That is most often not the case. If that were so, why doesn't HT see sweeping
>>>gains instead of 30-40%? If the threads spent even half of their lives blocked
>>>on memory, HT would see at least a 100% performance increase. I say "at least
>>>100%" because HT also allows threads to concurrently execute to utilize more
>>>execution bandwidth.
>>>
>>
>>All you have to do is think about the question.  HT runs _two_ threads.  If
>>_both_ are waiting on memory, nothing good will happen.  That is common.  On
>>a multiprogramming O/S, if both threads do nothing but I/O there is no
>>benefit.
>
>If HT runs 2 threads and both stall on memory frequently, adding more threads
>means waiting on memory will take even longer, and the additional threads won't
>give a higher overall IPC.

That is correct.  Just as running 10 I/O bound processes is not efficient on
any O/S around either.  To make multiprogramming O/S systems work, you need a
_mix_ of I/O and compute-bound processes.  To make SMT work you need a mix of
computation and memory accessing...

>
>If, in the other extreme case, threads stall on memory rarely, slightly higher
>IPC can be extracted up to a limit (because not 100% of execution bandwidth can
>be utilized per cycle by a single thread).

I don't disagree...

As I have reported, the first test I ran showed Crafty running 30+ percent
faster using HT.  Eugene reported that running two copies of the EGTB compressor
program ran _two_ times faster.  They probably have the optimal mix of a
significant number of micro-ops between memory references...


>
>In either case, I don't see how what I've said (that HT will scale poorly) is
>inaccurate.

Simple question:  does an operating system do significantly worse with four
processes than with two?  If not, then why will HT be different?  Yes, there
are _obviously_ cases where more than one thread is not going to be more
efficient, just as it is true in a modern O/S.  But for the _typical_ case
going to 4, 8 or 16 will offer some gains.  Will 8 be 2x better than 4?
Doubtful since 8 cpus are generally not 2x better than 4.  But the gain is
there, and the cost is minimal, which is what makes it work.




>
>Obviously the actual trend is skewed but near the middle since Intel estimates
>30-40% speed-up. Since neither of the cases above hold completely true, HT might
>scale higher than 2 logical CPUs. Maybe to 4. My opinion is that they won't hit
>8 or beyond. I don't think Intel will see enough potential gain from 4 to
>implement it. Four would be more awkward for the Pentium 4, anyway; it can only
>execute up to 3 u-ops/cycle.
>

I think that over time it will scale higher, because that only requires a wider
path from cache to supply more micro-ops per cycle, and more memory (cache)
reads per cycle.  And then more functional units (pipes) to eat those
micro-ops.

And probably eventually we will end up with multiple true CPUS on a single chip
from them as well, as at some point that will be simpler than scaling SMT to the
next level.  The tendency is that more stuff in parallel causes more synch
problems and more complex logic, and at some point you have to back off of that
a bit...




>>>>>I might also point out that the number of instructions between memory accesses
>>>>>is -usually- a good bit higher than 3. Some routines can go 20+ instructions in
>>>>>between accessing memory. It all depends on what you're computing.
>>>>
>>>>Maybe.  But then again, once you do that memory access you sit for several
>>>>hundred
>>>>clock cycles.  Giving _plenty_ of spare time for other threads to run.  Which is
>>>>the
>>>>point, as I said.
>>>
>>>Around 100 for full penalty on both of my machines. Pentium 4 on RDRAM suffers
>>>much worse, partly because the multiplier is enormous. Most cache misses won't
>>>even incur the full penalty on well-written software.
>>>
>>
>>
>>You are not going to get memory data back in 100 cycles on a 3ghz machine.
>>YOu are not going to get it back in 300 cycles...
>>
>>This from actual benchmark data, not from vendors claims...
>
>Something like 80-270 cycles on a 3 GHz machine according to extrapolation from
>data I measured earlier tonight which correlates closely to the timing figure
>you posted (135 ns). The 270 cycles is assuming the worst case. If you don't
>change row, you incur 80 cycles, much lower. Not great, but not as terrible as
>300.

All I can say is that I have _never_ gotten latency below 100ns.  NOt clocks.
nanoseconds.  Every machine I have tested for the past 20+ years (based on DRAM
memory of course, not machines with bi-polar or SRAM type memories like older
Crays) has dropped data into the cpu with 100-130ns delays.  This goes from
the alpha, to IBM SPs to Intel, to the latest Cray supercomputers, right on down
the line.  Regardless of what vendors say, when I run a random-access memory
test, the _same_ number comes up whether it is a $2,000 micro or a $60,000,000
supercomputer.  DRAM has a horrible latency penalty caused by trying to dump
electrical charges that have to overcome resistance, capacitance and inductance
to get to the final destination.

I've given numbers before...

Back to 1975. Cray-1.  8 clock cycles to read a random memory address.  Clock
cycle = 12.5ns.  Latency = 100ns.  Cray XMP.  8.1ns clock.  12 clock cycles.
Cray YMP.  6.1ns clock.  16 clock cycles.  C90.  4.1ns clock.  25 clock
cycles.  T90.  2ns clock.  50 clock cycles.  Cray 2.  4ns clock.  40 clock
cycles.  Cray 3.  1ns clock.  100 clock cycles.

That 100ns barrier using DRAM has been around since the first DRAM chip rolled
out, and it hasn't dropped a bit that I can detect.  Memory seems faster, yes.
Cache.  burst-mode transfers, pre-fetch within the memory system, wider
data paths.  But those only help sequential reads...

>
>>>However, if threads are so often blocked, why does HT only see 30-40% speed
>>>increase? As I recall, your figures showed only about 16% speed increase.
>>>Granted you did not have the pause instruction, but you also say that you don't
>>>lock very often.
>>
>>I answered that already.  Same problem with two I/O bound processes.
>>
>>>>>Also, most threads get more than 1 u-op/cycle in the execution unit. This
>>>>>starves other threads. If you recompute the above making the assumption that a
>>>>>thread can fully utilize the CPU with an ave_ipc of 3.0, it makes HT less useful
>>>>>(ideal number = 7.0 instead of 9.0). This is no suprise since HT is designed
>>>>>around the central idea that an application cannot fully utilize 3 u-ops/cycle
>>>>>execution bandwidth.
>>>>>
>>>>
>>>>
>>>>I won't disagree there, because their basic assumption is perfectly valid.
>>>>Running nine
>>>>small compute-bound tasks is not very realistic.  Running 20 large applications
>>>>banging
>>>>on memory makes much more sense...
>>>>
>>>>And there HT will smoke.
>>>>
>>>>>As a final disclaimer, I know that this model used a lot of flawed numbers.
>>>>>However, all but one favored proliferation of HT. The only figure that would
>>>>>allow HT to scale efficiently to higher numbers is the latency of a memory
>>>>>access. If this were significant at all for general applications, turning off
>>>>>the cache wouldn't make such a huge difference since general applications would
>>>>>be cache missing often already.
>>>>
>>>>I don't understand the statement.  Latency is _the_ issue today.  It will be 2x
>>>>worse when
>>>>cpus go 2x faster tomorrow.  On a single memory read today, which takes on the
>>>>order of
>>>>120ns absolute best case, a 3ghz processor will wait for 360 clocks.  And since
>>>>the 2x
>>>>integer units are double-clocked, we miss the opportunity to execute at _least_
>>>>720
>>>>micro-ops per integer unit.  That leaves a _lot_ of time to factor in other
>>>>threads to eat
>>>>up those idle clock cycles...
>>>
>>>120ns is worst case.
>>
>>No it is the _typical_ case.  Just run a real random-access memory pattern.
>>You are not going to beat 120ns with _any_ DRAM-based memory system.
>
>Random-access for me was 138 ns. Serial access is 37 ns.
>
>>>My admission was that if applications incur memory latency -often-, it will make
>>>HT scale well. My point was that this doesn't happen with general applications.
>>
>>My point is that it happens in _all_ applications.  Which hurts the scaling,
>>since there is no way to speed up those parallel memory references since memory
>>is a serial device.
>
>For one, any sort of vector manipulations take time, fit snugly in a P4 cache
>line, and are usually accessed serially allowing prefetching. Video/audio data
>can almost -always- be accessed serially. Databases are practically RDRAM's
>flagship. Most compute-bound applications I can think of off the top of my head
>don't suffer from repeated random memory probes. Serial memory probes aren't
>nearly as slow since you don't incur the full penalty, can prefetch, and get the
>benefit of the cache.
>
>-Matt


I don't disagree.  There are _always_ examples of algorithms that do that.
And for algorithms that use too much memory at once, there are "blocked
versions" that work on small parts of the array repetitively before moving
on to the next section, to take advantage of cache re-use.  And those are
the algorithms that _smoke_ on a super-computer, as a Cray can read four
words and write two words (64 bit words) _every_ single clock cycle, for
memory reference sequences that have some sort of pattern to them.

Back to chess, of course, and this goes away.  Harry Nelson and I spent years
developing algorithms for chess that would vectorize.  They are neither obvious
nor simple, in many cases.  But they would actually be somewhat beneficial on
today's micros due to the way architects are using smoke and mirrors to hide the
true memory latency problems...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.