Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Robert Hyatt

Date: 10:38:25 12/14/02

Go up one level in this thread


On December 14, 2002 at 02:01:20, Matt Taylor wrote:

>On December 13, 2002 at 22:51:35, Robert Hyatt wrote:
>
>>On December 13, 2002 at 21:55:09, Matt Taylor wrote:
>>
>>><snip>
>>>>>the problem is you lose time to the ECC and registered features of the
>>>>>memory you need for the dual. of course that's the case for all duals.
>>>>>both K7 MP and Xeon suffer from that regrettably.
>>>>
>>>>That is not true.  The duals do _not_ have to have ECC ram.  And it doesn't
>>>>appear to be
>>>>any slower than non-ECC ram although I will be able to test that before long as
>>>>we have
>>>>some non-ECC machines coming in.
>>>
>>>Actually he is correct about the registered ram. The "registered" feature is
>>>that it delays longer than unregistered ram. This is important for stability. It
>>>doesn't affect bandwidth, but it does affect latency.
>>>
>>><snip>
>>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved
>>>>>the SMT another bit.
>>>>>
>>>>>Seems to me they working for years to get that SMT/HT slowly better working.
>>>>
>>>>Not "for years".  It was announced as a coming thing a couple of years ago and
>>>>several
>>>>vendors have been discussing the idea.  And they are going to increase the ratio
>>>>of physical
>>>>to logical cpus before long also...
>>>
>>>I don't think so. HT won't scale terribly well. I made another post about that,
>>>and I won't reiterate what I said there.
>>>
>>>-Matt
>>
>>
>>I don't see why the two of you make such sweeping generalizations.  What is
>>to prevent modifying the L1 cache to spit out 256 bits of data at once?  There
>>is nothing internally that can't be improved over time, and the idea of a
>>4-way hyper-threaded cpu should eventually be just as effective as four
>>completely separate cpus, although the price should be substantially lower.
>
>It's not really a "sweeping generalization." I was quite specific. I said HT
>won't scale very well.

And that _is_ a sweeping generalization.  HT can potentially scale just as well
as adding additional CPUs can scale, and probably better because the logical
cpus
are so tightly coupled that there is a savings in hardware space/cost that makes
them
very attractive.


>
>Currently, cost will prohibit the scaling of HT. You can argue that R&D will
>eventually make it cost-effective, but that doesn't mean it will happen
>tomorrow. Also, complexity of sharing the cache among threads, the size of the
>cache, and the width of the cache all contribute to cost and overall chip
>complexity. Internally, IA-32 processors are only 64-bits wide. This is easy to
>verify. Throughput for single-precision floating-point SSE operations is only
>twice the throughput for the SISD FPU.

The cache isn't really shared among threads.   So far as the cache is concerned
there are
no threads, just data to feed to the cpu...  Also, nothing says that IA32 will
remain on a
64 bit internal processor bus.  Just look at the evolution from the first 32 bit
processor
to today.  And those changes are transparent to the outside world so that they
can be
done by Intel when they choose without affecting existing code.  Ditto for
making the
L1 trace datapath wider to dump more micro-ops at once.





>
>My comment, however, was based on the fact that HT shares the execution units
>among threads. Neglecting memory accesses briefly, the worst-case performance on
>a P4 is 33% of capacity (100% capacity = 3 u-ops/cycle). This means that at most
>3 concurrent HT threads will max out the CPU. Of course, HT should actually
>scale a little higher since the threads must all pause to access memory
>occasionally. The question becomes, "How often do threads access memory?" A
>single thread accesses memory quite often, but most accesses are cache hits.
>

Your math is _way_ wrong there.  If I do a memory read in one thread, the CPU
is going to sit idle for hundreds of clocks.  There is _plenty_ of room to
interlace
other threads, particularly when most threads do memory reads here and there.

Your statement is identical with saying that a modern processor should never be
asked to
run more than one process at a time because it has only one cpu to use.  Yet
99.9% of
those processes don't need the CPU for long periods of time...






>Here are some equations for computation of optimum number of HT CPUs:
>effective_ipc = ave_ins_between_mem / (ave_ins_between_mem / ave_ipc +
>mem_latency)
>num_ht_cpus = max_ipc / effective_ipc
>
>Now, some P4/P4 Xeon numbers:
>max_ipc = 3 u-ops/cycle
>ave_ipc = 1 u-op/cycle (assuming worst-case)
>ave_ins_between_mem = 3 u-ops/cycle (also lower than actual)
>mem_latency = 2 cycles (for L1 cache)


L1 cache isn't _the_ issue.  Real memory is.  And it is getting worse as 1gb is
the normal
memory size now.   that is a 2048 to 1 mapping to L2 cache, and far worse to L1.
 Since
real programs access memory frequently, HT will _always_ work just as
multiprogramming
works on a single-cpu machine, because most threads are doing something that
takes a long time
to complete (I/O to disk or whatever for processes in an O/S, memory reads to a
thread in the
CPU.

I don't see how this point is getting overlooked.  If SMT won't work, then an
O/S that runs
more processes than CPUs won't work.  No, it might not help if _all_ processes
are screaming
along in L1 cache, but that isn't realistic for real-world applications.  They
are going to have
a non-trivial memory bandwidth requirement and SMT lets other threads fill in
the gaps when
memory reads are hanging.



>
>effective_ipc = 3 / (3 / 1 + 6) = 0.33 ipc
>num_ht_cpus = 3 / 0.33 = 9
>
>Making unrealistic assumptions that favor more HT CPUs only shows HT scaling to
>9 CPUs before the threads start trampling over each other. Note that I also
>unrealistically assumed that the CPU can execute all u-ops regardless of the
>circumstances. I did assume that all cache accesses hit the L2, but accesses to
>main memory are infrequent for general applications. This is a fair assumption
>since Intel isn't marketting the P4 to people interested in chess; they're
>marketting the P4 to many different people with many different applications.

And you are making the assumption that all the threads are not requiring memory.
 If you
run 9 threads, I'd bet that at any point in time less than 1/2 of them are ready
to actually
fire off micro-ops into the CPU core.  Most are going to be hung up waiting on
the memory
read unit to supply needed data into a register.




>
>I might also point out that the number of instructions between memory accesses
>is -usually- a good bit higher than 3. Some routines can go 20+ instructions in
>between accessing memory. It all depends on what you're computing.

Maybe.  But then again, once you do that memory access you sit for several
hundred
clock cycles.  Giving _plenty_ of spare time for other threads to run.  Which is
the
point, as I said.





>
>Also, most threads get more than 1 u-op/cycle in the execution unit. This
>starves other threads. If you recompute the above making the assumption that a
>thread can fully utilize the CPU with an ave_ipc of 3.0, it makes HT less useful
>(ideal number = 7.0 instead of 9.0). This is no suprise since HT is designed
>around the central idea that an application cannot fully utilize 3 u-ops/cycle
>execution bandwidth.
>


I won't disagree there, because their basic assumption is perfectly valid.
Running nine
small compute-bound tasks is not very realistic.  Running 20 large applications
banging
on memory makes much more sense...

And there HT will smoke.

>As a final disclaimer, I know that this model used a lot of flawed numbers.
>However, all but one favored proliferation of HT. The only figure that would
>allow HT to scale efficiently to higher numbers is the latency of a memory
>access. If this were significant at all for general applications, turning off
>the cache wouldn't make such a huge difference since general applications would
>be cache missing often already.

I don't understand the statement.  Latency is _the_ issue today.  It will be 2x
worse when
cpus go 2x faster tomorrow.  On a single memory read today, which takes on the
order of
120ns absolute best case, a 3ghz processor will wait for 360 clocks.  And since
the 2x
integer units are double-clocked, we miss the opportunity to execute at _least_
720
micro-ops per integer unit.  That leaves a _lot_ of time to factor in other
threads to eat
up those idle clock cycles...





>
>>There is _nothing_ inherently wrong with taking the process scheduler completely
>>out of the O/S and dropping it into the CPU.  It makes perfect sense as the
>>operating system context-switching time is so long that we can't afford to
>>block processes for memory accesses, and have to block only for things that
>>take far longer like I/O operations.  Inside the cpu this context switching
>>time becomes nearly zero, and the gains are significant.
>>
>>It might take a while to get there, but you will see 16-way SMT one day.  Just
>>as surely as you will one day see 4 cpus on a single chip...
>
>HT isn't about context-switching. It's about parallelism and concurrency.
>Threads in HT are concurrent. HT has as much to do with process scheduling as
>SMP does -- just about nil.


Absolutely incorrect.  On a single cpu machine, the operating system gets to
switch
back and forth between two compute bound processes.  On a hyper-threaded cpu
this
does _not_ happen.  The cpu schedules _both_ processes and lets the cpu
interlace them
much more efficiently.

That ought to be obvious if you have any O/S experience...





>
>You keep touting that the processor can do things while a thread blocks on
>memory, but this is extremely rare. HT is really to take advantage of execution
>units not normally utilized. While the memory aspect is appealing for a guy
>designing a chess engine, Intel didn't add HT to appeal to a niche crowd. In the
>general case, HT isn't going to do a lot for you.


The _first_ implementation is not doing badly.  The _second_ generation will be
better.  This is not a "new topic" it has been discussed in computer
architecture forums
for 10 years now.  Only recently has the chip density reached the point where it
became
possible to actually do this.  It won't be too long before it is just as
standard as the
super-scalar architectures are today.  10 years ago many were saying "this won't
work (super-scalar)."  It _does_ work and it works well as the kinks are ironed
out.
HT/SMT is nothing different.  And it will be just as normal in a few years on
all
processors as super-scalar is today.





>
>-Matt



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.