Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Matt Taylor

Date: 23:01:20 12/13/02

On December 13, 2002 at 22:51:35, Robert Hyatt wrote:

>On December 13, 2002 at 21:55:09, Matt Taylor wrote:
>
>><snip>
>>>>the problem is you lose time to the ECC and registered features of the
>>>>memory you need for the dual. of course that's the case for all duals.
>>>>both K7 MP and Xeon suffer from that regrettably.
>>>
>>>That is not true.  The duals do _not_ have to have ECC ram.  And it doesn't
>>>appear to be
>>>any slower than non-ECC ram although I will be able to test that before long as
>>>we have
>>>some non-ECC machines coming in.
>>
>>Actually he is correct about the registered ram. The "registered" feature is
>>that it delays longer than unregistered ram. This is important for stability. It
>>doesn't affect bandwidth, but it does affect latency.
>>
>><snip>
>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved
>>>>the SMT another bit.
>>>>
>>>>Seems to me they working for years to get that SMT/HT slowly better working.
>>>
>>>Not "for years".  It was announced as a coming thing a couple of years ago and
>>>several
>>>vendors have been discussing the idea.  And they are going to increase the ratio
>>>of physical
>>>to logical cpus before long also...
>>
>>I don't think so. HT won't scale terribly well. I made another post about that,
>>and I won't reiterate what I said there.
>>
>>-Matt
>
>
>I don't see why the two of you make such sweeping generalizations.  What is
>to prevent modifying the L1 cache to spit out 256 bits of data at once?  There
>is nothing internally that can't be improved over time, and the idea of a
>4-way hyper-threaded cpu should eventually be just as effective as four
>completely separate cpus, although the price should be substantially lower.

It's not really a "sweeping generalization." I was quite specific. I said HT
won't scale very well.

Currently, cost will prohibit the scaling of HT. You can argue that R&D will
eventually make it cost-effective, but that doesn't mean it will happen
tomorrow. Also, complexity of sharing the cache among threads, the size of the
cache, and the width of the cache all contribute to cost and overall chip
complexity. Internally, IA-32 processors are only 64-bits wide. This is easy to
verify. Throughput for single-precision floating-point SSE operations is only
twice the throughput for the SISD FPU.

My comment, however, was based on the fact that HT shares the execution units
among threads. Neglecting memory accesses briefly, the worst-case performance on
a P4 is 33% of capacity (100% capacity = 3 u-ops/cycle). This means that at most
3 concurrent HT threads will max out the CPU. Of course, HT should actually
scale a little higher since the threads must all pause to access memory
occasionally. The question becomes, "How often do threads access memory?" A
single thread accesses memory quite often, but most accesses are cache hits.

Here are some equations for computation of optimum number of HT CPUs:
effective_ipc = ave_ins_between_mem / (ave_ins_between_mem / ave_ipc +
mem_latency)
num_ht_cpus = max_ipc / effective_ipc

Now, some P4/P4 Xeon numbers:
max_ipc = 3 u-ops/cycle
ave_ipc = 1 u-op/cycle (assuming worst-case)
ave_ins_between_mem = 3 u-ops/cycle (also lower than actual)
mem_latency = 2 cycles (for L1 cache)

effective_ipc = 3 / (3 / 1 + 6) = 0.33 ipc
num_ht_cpus = 3 / 0.33 = 9

Making unrealistic assumptions that favor more HT CPUs only shows HT scaling to
9 CPUs before the threads start trampling over each other. Note that I also
unrealistically assumed that the CPU can execute all u-ops regardless of the
circumstances. I did assume that all cache accesses hit the L2, but accesses to
main memory are infrequent for general applications. This is a fair assumption
since Intel isn't marketting the P4 to people interested in chess; they're
marketting the P4 to many different people with many different applications.

I might also point out that the number of instructions between memory accesses
is -usually- a good bit higher than 3. Some routines can go 20+ instructions in
between accessing memory. It all depends on what you're computing.

Also, most threads get more than 1 u-op/cycle in the execution unit. This
starves other threads. If you recompute the above making the assumption that a
thread can fully utilize the CPU with an ave_ipc of 3.0, it makes HT less useful
(ideal number = 7.0 instead of 9.0). This is no suprise since HT is designed
around the central idea that an application cannot fully utilize 3 u-ops/cycle
execution bandwidth.

As a final disclaimer, I know that this model used a lot of flawed numbers.
However, all but one favored proliferation of HT. The only figure that would
allow HT to scale efficiently to higher numbers is the latency of a memory
access. If this were significant at all for general applications, turning off
the cache wouldn't make such a huge difference since general applications would
be cache missing often already.

>There is _nothing_ inherently wrong with taking the process scheduler completely
>out of the O/S and dropping it into the CPU.  It makes perfect sense as the
>operating system context-switching time is so long that we can't afford to
>block processes for memory accesses, and have to block only for things that
>take far longer like I/O operations.  Inside the cpu this context switching
>time becomes nearly zero, and the gains are significant.
>
>It might take a while to get there, but you will see 16-way SMT one day.  Just
>as surely as you will one day see 4 cpus on a single chip...

HT isn't about context-switching. It's about parallelism and concurrency.
Threads in HT are concurrent. HT has as much to do with process scheduling as
SMP does -- just about nil.

You keep touting that the processor can do things while a thread blocks on
memory, but this is extremely rare. HT is really to take advantage of execution
units not normally utilized. While the memory aspect is appealing for a guy
designing a chess engine, Intel didn't add HT to appeal to a niche crowd. In the
general case, HT isn't going to do a lot for you.

-Matt

Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading Robert Hyatt 10:38:25 12/14/02
- Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading Matt Taylor 23:18:44 12/14/02
  - Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading Robert Hyatt 06:39:17 12/15/02
    - Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading Matt Taylor 03:28:38 12/17/02
      - Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading Robert Hyatt 07:00:23 12/17/02
    - Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading Eugene Nalimov 11:39:16 12/15/02
      - Re: Data on hash table probing and memory bandwidth Robert Hyatt 09:02:40 12/16/02
        
        Re: Data on hash table probing and memory bandwidth Dieter Buerssner 10:32:46 12/16/02
        
        Re: Data on hash table probing and memory bandwidth Robert Hyatt 11:51:18 12/16/02
        
        Re: Data on hash table probing and memory bandwidth Miguel A. Ballicora 13:03:11 12/16/02
        
        Re: Data on hash table probing and memory bandwidth Robert Hyatt 13:31:21 12/16/02
        
        Re: Data on hash table probing and memory bandwidth Matt Taylor 17:32:17 12/16/02
        
        Re: Data on hash table probing and memory bandwidth Robert Hyatt 20:06:28 12/16/02

This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.