Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: final note I presume

Author: Matt Taylor
Date: 02:53:37 12/15/02
On December 14, 2002 at 14:10:14, Robert Hyatt wrote:

>>>>HT will not scale to large numbers of tasks. The IA-32 register set has 8 32-bit
>>>>general registers, 8 80-bit FPU registers, 2 16-bit FPU control registers, and 8
>>>>128-bit SSE registers. This means each logical CPU requires 244 bytes of
>>>>application register file alone. For simplicitly, I did not include the 3 groups
>>>>of system registers, the MTRRs, or the MSRs. There are additional caches which
>>>>would not allow HT to scale unless they were duplicated. Intel is not about to
>>>>put MBs of fast-access register file on an IA-32 processor. It would make your
>>>>128-cpu HT Pentium 5 cost more than a cluster of Itaniums with negligible
>>>>performance gain over a dual- or quad-Xeon system.
>>>
>>>Want to bet?  10 years ago they could not put cache on-chip due to space
>>>limitations.  That is now gone.  With up to 2mb of L2 cache on chip today,
>>>they really can do whatever they want in the future.  And there are not
>>>really duplicate sets of registers.  Just duplicate rename tables which are
>>>much smaller since they only store small pointers to real registers, rather
>>>than 32 bit values.
>>
>>If HT does not have 2 physical sets of registers, what do the remap tables point
>>to? Intel docs actually state that the logical CPU has a duplicated set of
>>registers.
>
>No they don't.  They have one large set of rename registers, and _two_ rename
>tables.
>This has been discussed here in the past and can be found in the intel docs.  If
>you had
>two separate sets of registers, how would the execution units deal with that?
>The
>answer:  they couldn't.  But by using one large set of registers, with a
>separate rename table
>for each processor, they simply take care to be sure that two concurrent threads
>do not rename
>eax to the same rename register, and then there is no confusion and the
>execution units don't
>care which thread produced the micro-ops being executed.  It makes the design
>_way_
>simpler and it is definitely elegant...

If you read what you just said, you're arguing the same thing I said. I wasn't
specific with semantics, but for 2 logical CPUs you need at least twice the
basic number of physical registers. For 8 CPUs, you will need 8 times that
number. You can't have 32,768 CPUs without 32,768 physical sets of registers. It
doesn't matter if they get lumped together. You still need space.

>>Also, it is important to differentiate between registers and L2 cache which can
>>take from 6-300 clocks to access. Even the 8 KB L1 cache that the P4 has takes 2
>>cycles to access. If it were possible to build a chip with such vast amounts of
>>fast cache memory, there would be no such thing as an on-chip cache heirarchy;
>>there would be an L1 cache on-chip and -maybe- an L2 cache on the motherboard.
>
>
>It isn't the "size" that counts.  It is the physical space that counts.  To make
>it larger,
>you limit where you can put it on the chip.  The farther away, the longer it
>takes to
>access the data.  Small caches can be tucked into holes here and there.  But not
>a 2mb
>cache that takes 10-12M transistors just for the flip-flops for the bits, not
>counting the
>addressing and so forth.

8 times the number of registers = 8 times the size...
When you include system registers, I would expect much closer to 1 KB or so of
register space, and perhaps even more. I forgot to also include the segment
registers, each a bit bulkier than 8 bytes. You said that size limits the
distance. If so, then how does a large amount of register space -not- limit the
distance?

>>The other argument worth making is that HT will hit diminishing returns very
>>quickly. It -may- not even be worth going to quad-HT. The main reason why HT
>>gets any performance gains is because two threads don't fully utilize the CPU's
>>execution capacity. It is convenient when a cache miss occurs because one thread
>>can utilize the full capacity of the processor, but across -most- applications
>>that is rare. Additionally, the processor has the ability to speculatively fetch
>>data and code. Cache misses are rare.
>
>They aren't rare.  And you aren't thinking the math thru.  Intel hopes for 95%
>hits.
>That leaves 5% misses.  And for each miss, we sit for 400 clocks or so waiting.
>IE
>one out of every 20 accesses misses, and the slowdown is tremendous.  HT fills
>in
>those long idle periods pretty well and going to 4 or 8 should not hit any
>asymptote
>assuming they can feed micro-ops to the core quickly enough...

They can feed up to 3 u-ops/cycle. Splitting 3 u-ops/cycle among 4-8 threads...
That also neglects the fact that now you have 4-8 threads competing for main
memory, so it may take 4-8 times longer to fill a request.

>>One of my machines has a BIOS option to disable the on-chip caches. When I
>>disable it, my 1.2 GHz Thunderbird runs extremely slow. Every memory access
>>effectively becomes a cache miss. If you have a machine with this option, you
>>can try it and see. If cache misses happened often enough to make a viable
>>impact on HT, you wouldn't see a big difference.
>
>Again, you are not thinking thru on the math.  Turn L1/L2 off and _everything_
>takes 400 clocks to read.  Turn it on and 95% of the stuff doesn't, but if you
>know
>anything about Amdahl's law, that last 5% is a killer, meaning that even if you
>drop the cpu clock cycle time to zero picoseconds, the cpu could not run 20x
>faster
>due to that last 5% being bound by memory speed, not processor clock cycle time.

Amdahl's Law applies to parallel computing. Memory accesses aren't really a
parallel computation problem...

Actually, it doesn't take 400 clocks to read memory; it takes 105 clocks for my
AthlonMP 1600 to read a random segment of memory for me (multiplier * 10 bus
clocks). That assumes that you have to re-latch RAS. If my processor doesn't
need to re-latch RAS, then it takes only about 27 clocks. Most memory accesses
are stack accesses, and stack accesses are almost -always- cache hits.

>>>>HT is merely a way to make the existing hardware more efficient. If it were
>>>>anything more, it would add -additional- hardware registers so the OS could
>>>>control the scheduling algorithm and specify the location of the ready queue. It
>>>>would also add instructions that would allow the processor to switch tasks.
>>>
>>>The processor _already_ is doing this.  But for processes that are ready to
>>>run rather than for processes that are long-term-blocked for I/O, etc.
>>
>>Yes, but the scheduler's job is to pick who runs, when they run, and how long
>>they run. HT only affects the first by allowing the scheduler to pick two tasks
>>to run instead of just one. HT isn't replacing the scheduler; it only
>>complicates it.
>
>No.  It simplifies the task when I can say "here run both of these".  Rather
>than having
>to bounce back and forth as a quantum expires.  It will get better when HT goes
>to
>4-way and beyond...

Again, as far as the OS is concerned, HT = SMP.

Are you trying to explain to me that SMP schedulers are simpler than single-CPU
schedulers? Synchronizing ready queues and picking multiple processes out is
easier than not synchronizing and picking the one off the top?

-Matt
Re: final note I presume Robert Hyatt 19:01:10 12/15/02
- Re: final note I presume Matt Taylor 14:47:49 12/16/02
  - Re: final note I presume Robert Hyatt 20:22:32 12/16/02
    - Re: final note I presume Matt Taylor 02:30:32 12/17/02
      - Re: final note I presume Robert Hyatt 07:24:55 12/17/02
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.