Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: final note I presume

Author: Robert Hyatt

Date: 11:10:14 12/14/02

Go up one level in this thread


On December 14, 2002 at 00:53:04, Matt Taylor wrote:

>On December 13, 2002 at 23:03:22, Robert Hyatt wrote:
>
>>On December 13, 2002 at 21:36:57, Matt Taylor wrote:
>>
>>>On December 13, 2002 at 11:33:17, Robert Hyatt wrote:
>>>
>>><snip>
>>>>Regardless of hand-waving nay-sayers.  It is a logical development of removing
>>>>one more
>>>>time-critical piece of code from the Operating System into the microprocessor,
>>>>namely the
>>>>task scheduler.
>>>
>>>I wouldn't say that. I opened task manager for a second on my NT box and
>>>discovered that I have 324 threads running.
>>
>>You don't have 324 threads running.  You have 324 threads in the system,
>>with probably 320 of them blocked long-term waiting on I/O thru a socket or
>>whatever.  It is the other _four_ that are interesting...  The threads that
>>are really computing.  Not the threads that are doing I/O or just sleeping
>>waiting on some event to happen.
>>
>>
>>> For HT to replace the NT task
>>>scheduler, it would need the capability to handle at least the 324 concurrent
>>>processes I have running now (technically it would need to handle *many* more).
>>>I quote the numbers for NT because Unix variants are usually not as prolific
>>>with scheduling units, usually because Unix threads/processes aren't very
>>>lightweight...
>>>
>>>
>>
>>
>>See above, that is simply not true.  First, unix supports both types of threads,
>>lightweight and heavyweight.  But none of that matters.  The only interesting
>>threads from the CPU's perspective are the threads that are not blocked waiting
>>on I/O and other stuff.  Just the threads that are ready to compute...
>>
>>And hyper-threading handles those perfectly...
>
>Yes, some variants handle light-weight threads. The traditional unit of
>scheduling in Unix has always been the process.

That's not been my experience, but the point really doesn't matter that much
to the discussion.  I've been using lightweight threads  since at _least_ 1980.
Cray
had them on the XMP.  Sun has had them.  In fact, I don't know of any unix
flavor
that doesn't have them.  You can always do a heavyweight fork() to get a totally
separate process with its own virtual address space, if you want, of course.  Or
you
can do lightweight processes (aka threads) if that suits you better.  I've been
using
threads (lightweight processes) since my first parallel search...  which dates
back
to 1978.


> That has affected the
>architecture somewhat, just as NT's lightweight threads means developers can
>throw a thread at a given problem. Perhaps that's why the subsystems of NT
>inexcusably create over 200 threads at boot-time. Anyway, that's all I meant --
>NT has a very large scheduling queue.

So does the typical unix box.  100 processes is not abnormal at all.  But, and
this
is the critical point, hardly _any_ of those threads are in the <ready> state.
All
are blocked waiting on I/O, or another proces, or for a time interval to elapse,
etc.




>
>>>HT will not scale to large numbers of tasks. The IA-32 register set has 8 32-bit
>>>general registers, 8 80-bit FPU registers, 2 16-bit FPU control registers, and 8
>>>128-bit SSE registers. This means each logical CPU requires 244 bytes of
>>>application register file alone. For simplicitly, I did not include the 3 groups
>>>of system registers, the MTRRs, or the MSRs. There are additional caches which
>>>would not allow HT to scale unless they were duplicated. Intel is not about to
>>>put MBs of fast-access register file on an IA-32 processor. It would make your
>>>128-cpu HT Pentium 5 cost more than a cluster of Itaniums with negligible
>>>performance gain over a dual- or quad-Xeon system.
>>
>>Want to bet?  10 years ago they could not put cache on-chip due to space
>>limitations.  That is now gone.  With up to 2mb of L2 cache on chip today,
>>they really can do whatever they want in the future.  And there are not
>>really duplicate sets of registers.  Just duplicate rename tables which are
>>much smaller since they only store small pointers to real registers, rather
>>than 32 bit values.
>
>If HT does not have 2 physical sets of registers, what do the remap tables point
>to? Intel docs actually state that the logical CPU has a duplicated set of
>registers.

No they don't.  They have one large set of rename registers, and _two_ rename
tables.
This has been discussed here in the past and can be found in the intel docs.  If
you had
two separate sets of registers, how would the execution units deal with that?
The
answer:  they couldn't.  But by using one large set of registers, with a
separate rename table
for each processor, they simply take care to be sure that two concurrent threads
do not rename
eax to the same rename register, and then there is no confusion and the
execution units don't
care which thread produced the micro-ops being executed.  It makes the design
_way_
simpler and it is definitely elegant...




>
>Also, it is important to differentiate between registers and L2 cache which can
>take from 6-300 clocks to access. Even the 8 KB L1 cache that the P4 has takes 2
>cycles to access. If it were possible to build a chip with such vast amounts of
>fast cache memory, there would be no such thing as an on-chip cache heirarchy;
>there would be an L1 cache on-chip and -maybe- an L2 cache on the motherboard.


It isn't the "size" that counts.  It is the physical space that counts.  To make
it larger,
you limit where you can put it on the chip.  The farther away, the longer it
takes to
access the data.  Small caches can be tucked into holes here and there.  But not
a 2mb
cache that takes 10-12M transistors just for the flip-flops for the bits, not
counting the
addressing and so forth.





>
>The L1 caches on Athlon and P4 are a good example of how size limits speed. The
>L1 cache on Athlon requires 3-cycles to access. On P4 it takes 2-cycles. Why? P4
>has 8 KB of L1 data. Athlon has 64 KB of L1 data.
>


Again, I can build a SRAM-based memory that can access from 16 bytes to 16
megabytes
at the exact same clock frequency.  It isn't about how much SRAM is included, it
is about
how big it is physically and where it has to be located.  Otherwise we'd see the
same
problem with DRAM if memory byte capacity meant anything...






>The other argument worth making is that HT will hit diminishing returns very
>quickly. It -may- not even be worth going to quad-HT. The main reason why HT
>gets any performance gains is because two threads don't fully utilize the CPU's
>execution capacity. It is convenient when a cache miss occurs because one thread
>can utilize the full capacity of the processor, but across -most- applications
>that is rare. Additionally, the processor has the ability to speculatively fetch
>data and code. Cache misses are rare.

They aren't rare.  And you aren't thinking the math thru.  Intel hopes for 95%
hits.
That leaves 5% misses.  And for each miss, we sit for 400 clocks or so waiting.
IE
one out of every 20 accesses misses, and the slowdown is tremendous.  HT fills
in
those long idle periods pretty well and going to 4 or 8 should not hit any
asymptote
assuming they can feed micro-ops to the core quickly enough...


>
>One of my machines has a BIOS option to disable the on-chip caches. When I
>disable it, my 1.2 GHz Thunderbird runs extremely slow. Every memory access
>effectively becomes a cache miss. If you have a machine with this option, you
>can try it and see. If cache misses happened often enough to make a viable
>impact on HT, you wouldn't see a big difference.

Again, you are not thinking thru on the math.  Turn L1/L2 off and _everything_
takes 400 clocks to read.  Turn it on and 95% of the stuff doesn't, but if you
know
anything about Amdahl's law, that last 5% is a killer, meaning that even if you
drop the cpu clock cycle time to zero picoseconds, the cpu could not run 20x
faster
due to that last 5% being bound by memory speed, not processor clock cycle time.





>
>>>HT is merely a way to make the existing hardware more efficient. If it were
>>>anything more, it would add -additional- hardware registers so the OS could
>>>control the scheduling algorithm and specify the location of the ready queue. It
>>>would also add instructions that would allow the processor to switch tasks.
>>
>>The processor _already_ is doing this.  But for processes that are ready to
>>run rather than for processes that are long-term-blocked for I/O, etc.
>
>Yes, but the scheduler's job is to pick who runs, when they run, and how long
>they run. HT only affects the first by allowing the scheduler to pick two tasks
>to run instead of just one. HT isn't replacing the scheduler; it only
>complicates it.

No.  It simplifies the task when I can say "here run both of these".  Rather
than having
to bounce back and forth as a quantum expires.  It will get better when HT goes
to
4-way and beyond...


>
>FYI, HyperThreading looks like a regular CPU to the operating system. There may
>be some means of communicating that it's an HT CPU, but Intel made HT
>backward-compliant.
>
>-Matt



I already know that.  There is another issue in dual-cpu machines like mine.
The current
linux (and XP kernels) do not understand that if there are two compute-bound
proceses,
they should run on different physical processors, rather than on two logical
processors that
share a physical cpu.  The linux SMP group is working on this problem.  The
windows
.net system will also understand this.

And the CPUID instruction certainly lets you know you have a logical cpu as
there is
a SMT flag that the O/S can discover...  They just aren't using it very well
_yet_...  But SMT
is new, also so it will take some time.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.