Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: final note I presume

Author: Matt Taylor
Date: 02:30:32 12/17/02
On December 16, 2002 at 23:22:32, Robert Hyatt wrote:

>On December 16, 2002 at 17:47:49, Matt Taylor wrote:
>
>>On December 15, 2002 at 22:01:10, Robert Hyatt wrote:
>>
>>>On December 15, 2002 at 05:53:37, Matt Taylor wrote:
>>>
>>>>On December 14, 2002 at 14:10:14, Robert Hyatt wrote:
>>>>
>>>>>>>>HT will not scale to large numbers of tasks. The IA-32 register set has 8 32-bit
>>>>>>>>general registers, 8 80-bit FPU registers, 2 16-bit FPU control registers, and 8
>>>>>>>>128-bit SSE registers. This means each logical CPU requires 244 bytes of
>>>>>>>>application register file alone. For simplicitly, I did not include the 3 groups
>>>>>>>>of system registers, the MTRRs, or the MSRs. There are additional caches which
>>>>>>>>would not allow HT to scale unless they were duplicated. Intel is not about to
>>>>>>>>put MBs of fast-access register file on an IA-32 processor. It would make your
>>>>>>>>128-cpu HT Pentium 5 cost more than a cluster of Itaniums with negligible
>>>>>>>>performance gain over a dual- or quad-Xeon system.
>>>>>>>
>>>>>>>Want to bet?  10 years ago they could not put cache on-chip due to space
>>>>>>>limitations.  That is now gone.  With up to 2mb of L2 cache on chip today,
>>>>>>>they really can do whatever they want in the future.  And there are not
>>>>>>>really duplicate sets of registers.  Just duplicate rename tables which are
>>>>>>>much smaller since they only store small pointers to real registers, rather
>>>>>>>than 32 bit values.
>>>>>>
>>>>>>If HT does not have 2 physical sets of registers, what do the remap tables point
>>>>>>to? Intel docs actually state that the logical CPU has a duplicated set of
>>>>>>registers.
>>>>>
>>>>>No they don't.  They have one large set of rename registers, and _two_ rename
>>>>>tables.
>>>>>This has been discussed here in the past and can be found in the intel docs.  If
>>>>>you had
>>>>>two separate sets of registers, how would the execution units deal with that?
>>>>>The
>>>>>answer:  they couldn't.  But by using one large set of registers, with a
>>>>>separate rename table
>>>>>for each processor, they simply take care to be sure that two concurrent threads
>>>>>do not rename
>>>>>eax to the same rename register, and then there is no confusion and the
>>>>>execution units don't
>>>>>care which thread produced the micro-ops being executed.  It makes the design
>>>>>_way_
>>>>>simpler and it is definitely elegant...
>>>>
>>>>If you read what you just said, you're arguing the same thing I said. I wasn't
>>>>specific with semantics, but for 2 logical CPUs you need at least twice the
>>>>basic number of physical registers. For 8 CPUs, you will need 8 times that
>>>>number. You can't have 32,768 CPUs without 32,768 physical sets of registers. It
>>>>doesn't matter if they get lumped together. You still need space.
>>>
>>>yes...  but there is a big difference between two separate register files and
>>>one larger register file.    And if you are having problems executing
>>>instructions for all threads, you certainly aren't using all the rename
>>>registers that are available...
>>>
>>>But the interesting thing is that to go to 4-way SMT it is very likely that you
>>>don't have to go to 4x the rename register file size.  You can tune for the
>>>typical case and just allow it to run out in rare cases, and stall the decoder
>>>until some instructions retire and free up more registers.
>>
>>You would still need somewhere to store the registers. They have to be
>>persistent. Due to the architecture, the OS has to supply a task state segment.
>>It would be possible to dump the registers to the segment, but it would be even
>>more costly (incurring extra memory writes), and it would only work for
>>integer/segment registers.
>
>Not quite.  The OS doesn't get access to the rename register space, just the
>normal EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, etc.  Because it is in the
>"rename space" automatically as O/S instructions are decoded into micro-ops..
>
>There's no reason to try to dump the rename register space as that is completely
>hidden from the O/S entirely.  The OS only needs to save/restore the normal set
>of registers above...
>
>
>
>>
>>>>>>Also, it is important to differentiate between registers and L2 cache which can
>>>>>>take from 6-300 clocks to access. Even the 8 KB L1 cache that the P4 has takes 2
>>>>>>cycles to access. If it were possible to build a chip with such vast amounts of
>>>>>>fast cache memory, there would be no such thing as an on-chip cache heirarchy;
>>>>>>there would be an L1 cache on-chip and -maybe- an L2 cache on the motherboard.
>>>>>
>>>>>
>>>>>It isn't the "size" that counts.  It is the physical space that counts.  To make
>>>>>it larger,
>>>>>you limit where you can put it on the chip.  The farther away, the longer it
>>>>>takes to
>>>>>access the data.  Small caches can be tucked into holes here and there.  But not
>>>>>a 2mb
>>>>>cache that takes 10-12M transistors just for the flip-flops for the bits, not
>>>>>counting the
>>>>>addressing and so forth.
>>>>
>>>>8 times the number of registers = 8 times the size...
>>>>When you include system registers, I would expect much closer to 1 KB or so of
>>>>register space, and perhaps even more. I forgot to also include the segment
>>>>registers, each a bit bulkier than 8 bytes. You said that size limits the
>>>>distance. If so, then how does a large amount of register space -not- limit the
>>>>distance?
>>>
>>>You don't need _that_ large an amount.  IE 16kb of register space is huge.  And
>>>it is certainly doable on-chip and close by.  Even 1K is probably more than
>>>enough to get thru at least 4-way SMT, that is 256 registers which is a bunch.
>>
>>8 GPRs * 4 bytes = 32 bytes
>>1 flags * 4 bytes = 4 bytes
>>8 FPU regs * 10 bytes = 80 bytes
>>6 segregs * 8 bytes (actually more, CPU unpacks the data) = 48 bytes
>>8 SSE regs * 16 bytes = 128 bytes
>
>renaming doesn't apply to fp/SSE so far as I know.
>
>
>
>>4 control regs * 4 bytes = 16 bytes
>>6 debug regs * 4 bytes = 24 bytes
>>1 TR * 4 bytes (probably more with cache...) = 4 bytes
>>1 GDTR * 6 bytes = 6 bytes
>>1 IDTR * 6 bytes = 6 bytes
>>1 LDTR * 6 bytes = 6 bytes
>>
>>I lost count of the MSRs around 128, and I scrolled through 4 or 5 pages more. I
>>would estimate at least 160. Not all are writable, and not all are 64-bits wide.
>>However, it would be safe to assume (I think) that you have at least 100*8 bytes
>>= 800 bytes of r/w MSRs. (Most of the space is consumed by MSRs related to
>>performance counters and machine check.) You have an additional 354 bytes of
>>register space from the above register set, most of which can't be cached in
>>main memory.
>>
>
>I don't think any of those are renamable.  Only the 8 primary registers play
>into that, plus the flags, so far as I recall...  but that is a recollection
>and I did not go back and verify it in the old Intel papers I have around.
>
>
>
>
>>I did exclude the test registers, by the way, because even though they are r/w,
>>they're used to latch commands to the TLB for testing.
>
>Again, none of those are in the rename space.
>
>
>>
>>The segregs, GPRs, and LDTR all get loaded from the TSS. They can be discarded
>>at will (for a hefty price in most cases). However, the bulk of the register set
>>can't be.
>>
>>Also consider that AMD is pushing their 64-bit extensions to IA-32, and it seems
>>probable that Intel will actually follow AMD's lead for the second time in the
>>history of x86. There are rumors (which Intel will neither confirm or deny) that
>>Intel is working on a chip called Yamhill which is compatible with AMD x86-64.
>>This doubles the number of SSE registers, doubles the size of the GPRs, and
>>doubles the width of the GPRs. That's 96+128=224 bytes more that you'll need.
>
>
>Are you sure you understand the register renaming idea?  You don't duplicate
>the entire set of registers.  You simply have a set of registers that the
>rename table can point to.  And most registers are not renamed since they
>are not often the subject of  something like:
>
>      mov eax, something
>      inc eax
>      move something,eax
>      move eax, something_else
>      dec eax
>      mov something_else,eax
>
>Register renaming turns that apparent set of operand conflicts (eax) into
>two completely independent sets of three instructions that use a different
>register for each set...
>
>There would be no point in doing that for MSRs, and the like...  they aren't
>used like the above.

Yes, but what I'm saying is that you need register space somewhere for these
other registers, and a logical CPU needs to be small to fit on-die. If memory
serves correctly, the Athlon has about 40 rename registers. One could expect a
similar number from the P4.

What I'm trying to say is that you won't need a huge rename space, but other
things grow fast. SSE is big, and I don't think it's renamed either. It's slow
because it requires

>>>>>>The other argument worth making is that HT will hit diminishing returns very
>>>>>>quickly. It -may- not even be worth going to quad-HT. The main reason why HT
>>>>>>gets any performance gains is because two threads don't fully utilize the CPU's
>>>>>>execution capacity. It is convenient when a cache miss occurs because one thread
>>>>>>can utilize the full capacity of the processor, but across -most- applications
>>>>>>that is rare. Additionally, the processor has the ability to speculatively fetch
>>>>>>data and code. Cache misses are rare.
>>>>>
>>>>>They aren't rare.  And you aren't thinking the math thru.  Intel hopes for 95%
>>>>>hits.
>>>>>That leaves 5% misses.  And for each miss, we sit for 400 clocks or so waiting.
>>>>>IE
>>>>>one out of every 20 accesses misses, and the slowdown is tremendous.  HT fills
>>>>>in
>>>>>those long idle periods pretty well and going to 4 or 8 should not hit any
>>>>>asymptote
>>>>>assuming they can feed micro-ops to the core quickly enough...
>>>>
>>>>They can feed up to 3 u-ops/cycle. Splitting 3 u-ops/cycle among 4-8 threads...
>>>>That also neglects the fact that now you have 4-8 threads competing for main
>>>>memory, so it may take 4-8 times longer to fill a request.
>>>
>>>Of course it might.  But that is _the_ point.  For those cases, SMT won't do
>>>much.  Any more than running two chess engines on a single cpu is very
>>>efficient.  But for the right mix the gain is significant.  IE I can run one
>>>compute-bound process and one heavily I/O-bound process and if they each take
>>>one hour to run by themselves, they will not take much more than one hour to
>>>run together.  Run two I/O jobs together and they take two hours.  Ditto for
>>>two compute bound processes.  But for reasonable "mixes" the gaine is
>>>significant, and for the optimal mixes, the gain is tremendous...
>>>
>>>SMT is no different.  It won't speed up all kinds of applications.  But it will
>>>speed up many.  And in particular, it is very effective for threaded apps that
>>>share everything thru the L1/L2 caches anyway...  making cache-thrashing less of
>>>a problem.
>>
>>Cache thrashing isn't a problem if your scheduler knows well enough to schedule
>>your other thread instead of John Doe's other thread on the physical CPU you're
>>sharing.
>
>_if_ it is ready to run.  If it is ready to run at the time the second logical
>processor becomes idle.  Etc...

If the thread is ready to run, it can get scheduled on the second logical CPU.
If both threads are scheduled for the same quantum, they'll execute
concurrently. They won't execute interleaved; they execute -concurrently-. John
Doe can steal away some of your execution bandwidth. Even if he's doing I/O, I/O
can cost a lot of execution cycles (doing things like thread transitions, memory
copying, etc.)

A scheduler that doesn't know about HT can impede your performance unless you're
fairly certain that you're going to occupy all logical CPUs in the system.

>>>>>>One of my machines has a BIOS option to disable the on-chip caches. When I
>>>>>>disable it, my 1.2 GHz Thunderbird runs extremely slow. Every memory access
>>>>>>effectively becomes a cache miss. If you have a machine with this option, you
>>>>>>can try it and see. If cache misses happened often enough to make a viable
>>>>>>impact on HT, you wouldn't see a big difference.
>>>>>
>>>>>Again, you are not thinking thru on the math.  Turn L1/L2 off and _everything_
>>>>>takes 400 clocks to read.  Turn it on and 95% of the stuff doesn't, but if you
>>>>>know
>>>>>anything about Amdahl's law, that last 5% is a killer, meaning that even if you
>>>>>drop the cpu clock cycle time to zero picoseconds, the cpu could not run 20x
>>>>>faster
>>>>>due to that last 5% being bound by memory speed, not processor clock cycle time.
>>>>
>>>>Amdahl's Law applies to parallel computing. Memory accesses aren't really a
>>>>parallel computation problem...
>>>
>>>They are in this context, because you can do anything you want to that 95%
>>>of the code that runs in cache, but that last 5% is the killer and that is the
>>>part that makes SMT work...
>>>
>>>
>>>
>>>>
>>>>Actually, it doesn't take 400 clocks to read memory; it takes 105 clocks for my
>>>>AthlonMP 1600 to read a random segment of memory for me (multiplier * 10 bus
>>>>clocks). That assumes that you have to re-latch RAS. If my processor doesn't
>>>>need to re-latch RAS, then it takes only about 27 clocks. Most memory accesses
>>>>are stack accesses, and stack accesses are almost -always- cache hits.
>>>
>>>It only takes 10 clocks on a 100mhz machine.  But I am talking top-end which is
>>>in the 3,000mhz range.  And there it is 300+ clocks, period...  You keep saying
>>>that same thing about the stack.  And it is true for _some_ programs.  It is
>>>_not_ true for many.  Crafty is one prime example that doesn't use much stack
>>>space between procedure calls.  I have several other similar programs here
>>>such as a couple that do molecular modeling, some simulation codes, and
>>>similar things that use very large static arrays of data.
>>
>>A 3.06 GHz Pentium 4 has a multiplier of 23.0. On pc2100 DDR SDRAM, this is 230
>>clocks worst-case, 43-56 clocks when RAS latency isn't incurred.
>
>I will say this again, carefully.  I _run_ on these machines.  I haven't found
>a single machine in a year that can produce that kind of low latency.  Just do
>random accesses, spread out at _least_ as long as a cache line, and run it for
>a good while.  I get 100-130ns as the average latency.  At 3.06ghz, that turns
>into 300 clocks +...

On SMP it's worse. On RDRAM it's worse. And 230 clocks = 97 ns.

A random access is that high. A typical access in many compute-bound
applications is not. When you hit the same row in memory, you won't incur the
full latency. When you're always doing random probes to memory, yes, it will be
higher. A lot of compute-bound applications aren't completely random memory
probes. The central assumption that I have made is that Intel is marketting HT
to a broader compute-bound market. It obviously means nothing (well, maybe
marketting hype) to the desktop market.

>Someone in r.g.c.c. argued the same point.  But when he ran the above test,
>he came back and reported "You are correct, I get 135ns latency, not the
>60 I was expecting"

I am not someone in the r.g.c.c., and what I have said approximately agrees with
your figure. I realize now that I did not include the latency from the
northbridge to the CPU, which accounts for the larger figure. However, the
figure is still smaller than 135 ns in most cases.

Yes, a -random- probe will incur around that much latency. I have dotted my i's
and crossed my t's here and been careful to state every time that the smaller
figure is when you don't need to relatch the row. Not all applications require
random probes. A few do some random probes and a lot of serial access.

I also did some measurements. On serial probes, I incur about 52 cycles of
latency (37 ns) on my AthlonMP 1600 system. This is a multiplier of 10.5 or
effectively 5 bus cycles. Random probes are 138 ns latency for me which is
approximately equal to your figure. It would also completely agree with my
previous statements if northbridge latency were taken into account.

Notice the correlation between what I have said and my data, by the way. The RAS
latency on my system is 7 bus clocks, and the CAS latency is 2.5 clocks. The
ratio of random:serial (RAS + CAS / CAS) access latency should be 3.8. Empirical
data shows 3.72, but it is skewed low by a constant bias in the northbridge
latency.

>>AMD was, I believe, the first to introduce explicit prefetch capabilities to
>>their K6-2. Intel added prefetch capabilities in the P3, and every CPU since has
>>had it. It doesn't benefit all applications, but applications that can operate
>>on "very large static arrays of data" benefit quite nicely, particularly when
>>manipulations take more than a few cycles of time. I was reading a nice document
>>from Intel the other day published c.a. '99 about computation of prefetch
>>scheduling distance.
>
>Notice I am talking about random-access.  No prefetching in the world will
>help that.  And that is the worst-case latency example.  And it is going to
>be well over 100ns no matter what all the prefetching examples suggest...
>
>prefetching will actually hurt, in that case.

Prefetching will never hurt unless you speculatively load data that you never
use.

You are talking about random access, but I fail to see how -ALL- compute-bound
applications never hit serialized memory access. My thought is that the average
compute-bound application that Intel is targetting does far more serial memory
access than random memory access. Admittely it's a difficult point to argue
since it's hard to get an average sample of applications and say which case is
true. Additionally, many access memory with non-deterministic patterns.

>>>>>>>>HT is merely a way to make the existing hardware more efficient. If it were
>>>>>>>>anything more, it would add -additional- hardware registers so the OS could
>>>>>>>>control the scheduling algorithm and specify the location of the ready queue. It
>>>>>>>>would also add instructions that would allow the processor to switch tasks.
>>>>>>>
>>>>>>>The processor _already_ is doing this.  But for processes that are ready to
>>>>>>>run rather than for processes that are long-term-blocked for I/O, etc.
>>>>>>
>>>>>>Yes, but the scheduler's job is to pick who runs, when they run, and how long
>>>>>>they run. HT only affects the first by allowing the scheduler to pick two tasks
>>>>>>to run instead of just one. HT isn't replacing the scheduler; it only
>>>>>>complicates it.
>>>>>
>>>>>No.  It simplifies the task when I can say "here run both of these".  Rather
>>>>>than having
>>>>>to bounce back and forth as a quantum expires.  It will get better when HT goes
>>>>>to
>>>>>4-way and beyond...
>>>>
>>>>Again, as far as the OS is concerned, HT = SMP.
>>>
>>>Certainly...  I hope I never said otherwise. But price-wise, HT != SMP
>>>by a huge margin.
>>
>>Depends on how you look at it. My AthlonMP box cost about $1,100 for the
>>-entire- system when it was brand new, and I opted for a nice case and 1 GB of
>>ram. Intel is charging $700 for an HT-enabled P4, more than what my board with
>>both chips cost. :-)
>>
>>Undoubtedly though it isn't costing much extra to add HT capabilities to the
>>chip.
>
>
>This is a different issue.  Intel is Intel.  AMD is AMD.  Why does a Lincoln
>sell for more than a Mercury?  Quality is not that different.  Name recognition
>is, however.

It is. It just amused me to make note that it is possible to get SMP for about
the same cost if you consider other brands. Certainly though it's not costing
Intel more than nickels and dimes, if that, to add HT to their high-end CPUs.

>>>>Are you trying to explain to me that SMP schedulers are simpler than single-CPU
>>>>schedulers? Synchronizing ready queues and picking multiple processes out is
>>>>easier than not synchronizing and picking the one off the top?
>>>>
>>>>-Matt
>>>
>>>
>>>Let's define simpler.  Do you mean "less source code"?  If so, no.  If you mean
>>>"simpler mechanism to choose between two compute-bound processes" then a
>>>resounding _yes_.  Let's offload the process scheduling, the quantum nonsense,
>>>and everything else related to that into the CPU hardware and right out of
>>>something the O/S is spending time doing.  Bouncing between two tasks at the
>>>O/S level is a _painful_ thing to do in terms of overhead cost.  Doing it at the
>>>SMT level is essentially free...
>>
>>I would still say no to the "simpler mechanism to choose between two
>>compute-bound processes." The quantum nonsense still exists, and process
>>scheduling still occurs. It occurs twice as often.
>
>With two compute bound processes it _never_ occurs.  Which is the point.
>Because there are two processors and two runnable processes.  There is no need
>to have quantum runout and run the process queue, when you _know_ there is
>nothing to run.
>
>I can't vouch for all systems, but the old xerox MP system I did O/S work on
>certainly behaved like this...

I'll check tomorrow how NT's scheduler works if I get a chance. I don't think NT
does it, anyway. I've seen the scheduler for the Linux kernel, and I don't
remember anything like that. Presumably most desktop OSes don't. There is no way
for a process to communicate to the kernel that it is compute-bound on NT or
OS/2, and I don't know of any way to do so on garden-variety System V OSes,
though that is not my paradigm.

It would also be a nuisance for fault-tolerant systems. In the case of SMP, you
have no guarantee that you'll ever be interrupted by anything besides the timer.
Disabling the timer = cutting your own throat.

That actually -can- be resolved on x86 for a generic/real-time OS, I think. It
takes a bit of extra work. I did not know until earlier tonight that the Intel
timer chip had a watchdog timer built-in. It generates an NMI at intervals, and
it ticks at a maximum of 18.2 ms. In a good implementation, the interrupt
handler could execute very quickly without a task switch or mode switch. From
what I understand, NT uses the watchdog timer, but it still doesn't allow a
processor to go compute-bound.

It would also be possible to do a "warm reboot" (slow because it requires going
through BIOS init of the CPU) of the particular CPU in question. It would be
very annoying to implement as you would have to implement a software watchdog
timer and delegate it to a CPU that can't go compute-bound. If the scheduler is
heavily optimized, it would be easy to excuse not implementing that feature.

>>Obviously the OS scheduler -can't- schedule with such a fine granularity because
>>of scheduling overhead. HT would simplify a scheduler that tried to, but since
>>none can, it's more of an extension to existing schedulers. It doesn't simplify
>>anything; it just makes it possible to schedule at sub-nanosecond intervals.
>>
>>-Matt
>
>It lets me pick the only two runnable processes, get them going, and then go
>away until an interrupt occurs.  That is more efficient.

Sort've. You have to keep in mind that HT looks, smells, and feels like SMP.
What actually happens is you have 2 interrupts occur, 2 instances of the
scheduler pick 2 processes to schedule, run them, and then wait for 2 interrupts
to occur. Without HT, an interrupt occurs, it picks 1 process to execute, runs
it, and waits for the interrupt to occur. No need for locking code if you don't
have SMP.

Over the duration of a single quantum, HT means you have to make 2 choices with
locking semantics, and lack of HT means you only have to make 1 choice and may
not need locks. As above, general/real-time OSes can't let a buggy program run
away with a CPU. If the scheduler disabled the timer and let the processes run
without needing to make another choice, they could potentially dominate the
entire system. Schedulers don't do that anyway, though it would be nice.

I was intending to implement that feature where the scheduler would go away
until a compute-bound process was done, but dialog in this thread has forced me
to reconsider. I make real-time guarantees.

-Matt
Re: final note I presume Robert Hyatt 07:24:55 12/17/02
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.