Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: final note I presume

Author: Robert Hyatt

Date: 07:24:55 12/17/02

Go up one level in this thread


On December 17, 2002 at 05:30:32, Matt Taylor wrote:

>On December 16, 2002 at 23:22:32, Robert Hyatt wrote:
>
>>On December 16, 2002 at 17:47:49, Matt Taylor wrote:
>>
>>>On December 15, 2002 at 22:01:10, Robert Hyatt wrote:
>>>
>>>>On December 15, 2002 at 05:53:37, Matt Taylor wrote:
>>>>
>>>>>On December 14, 2002 at 14:10:14, Robert Hyatt wrote:
>>>>>
>>>>>>>>>HT will not scale to large numbers of tasks. The IA-32 register set has 8 32-bit
>>>>>>>>>general registers, 8 80-bit FPU registers, 2 16-bit FPU control registers, and 8
>>>>>>>>>128-bit SSE registers. This means each logical CPU requires 244 bytes of
>>>>>>>>>application register file alone. For simplicitly, I did not include the 3 groups
>>>>>>>>>of system registers, the MTRRs, or the MSRs. There are additional caches which
>>>>>>>>>would not allow HT to scale unless they were duplicated. Intel is not about to
>>>>>>>>>put MBs of fast-access register file on an IA-32 processor. It would make your
>>>>>>>>>128-cpu HT Pentium 5 cost more than a cluster of Itaniums with negligible
>>>>>>>>>performance gain over a dual- or quad-Xeon system.
>>>>>>>>
>>>>>>>>Want to bet?  10 years ago they could not put cache on-chip due to space
>>>>>>>>limitations.  That is now gone.  With up to 2mb of L2 cache on chip today,
>>>>>>>>they really can do whatever they want in the future.  And there are not
>>>>>>>>really duplicate sets of registers.  Just duplicate rename tables which are
>>>>>>>>much smaller since they only store small pointers to real registers, rather
>>>>>>>>than 32 bit values.
>>>>>>>
>>>>>>>If HT does not have 2 physical sets of registers, what do the remap tables point
>>>>>>>to? Intel docs actually state that the logical CPU has a duplicated set of
>>>>>>>registers.
>>>>>>
>>>>>>No they don't.  They have one large set of rename registers, and _two_ rename
>>>>>>tables.
>>>>>>This has been discussed here in the past and can be found in the intel docs.  If
>>>>>>you had
>>>>>>two separate sets of registers, how would the execution units deal with that?
>>>>>>The
>>>>>>answer:  they couldn't.  But by using one large set of registers, with a
>>>>>>separate rename table
>>>>>>for each processor, they simply take care to be sure that two concurrent threads
>>>>>>do not rename
>>>>>>eax to the same rename register, and then there is no confusion and the
>>>>>>execution units don't
>>>>>>care which thread produced the micro-ops being executed.  It makes the design
>>>>>>_way_
>>>>>>simpler and it is definitely elegant...
>>>>>
>>>>>If you read what you just said, you're arguing the same thing I said. I wasn't
>>>>>specific with semantics, but for 2 logical CPUs you need at least twice the
>>>>>basic number of physical registers. For 8 CPUs, you will need 8 times that
>>>>>number. You can't have 32,768 CPUs without 32,768 physical sets of registers. It
>>>>>doesn't matter if they get lumped together. You still need space.
>>>>
>>>>yes...  but there is a big difference between two separate register files and
>>>>one larger register file.    And if you are having problems executing
>>>>instructions for all threads, you certainly aren't using all the rename
>>>>registers that are available...
>>>>
>>>>But the interesting thing is that to go to 4-way SMT it is very likely that you
>>>>don't have to go to 4x the rename register file size.  You can tune for the
>>>>typical case and just allow it to run out in rare cases, and stall the decoder
>>>>until some instructions retire and free up more registers.
>>>
>>>You would still need somewhere to store the registers. They have to be
>>>persistent. Due to the architecture, the OS has to supply a task state segment.
>>>It would be possible to dump the registers to the segment, but it would be even
>>>more costly (incurring extra memory writes), and it would only work for
>>>integer/segment registers.
>>
>>Not quite.  The OS doesn't get access to the rename register space, just the
>>normal EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, etc.  Because it is in the
>>"rename space" automatically as O/S instructions are decoded into micro-ops..
>>
>>There's no reason to try to dump the rename register space as that is completely
>>hidden from the O/S entirely.  The OS only needs to save/restore the normal set
>>of registers above...
>>
>>
>>
>>>
>>>>>>>Also, it is important to differentiate between registers and L2 cache which can
>>>>>>>take from 6-300 clocks to access. Even the 8 KB L1 cache that the P4 has takes 2
>>>>>>>cycles to access. If it were possible to build a chip with such vast amounts of
>>>>>>>fast cache memory, there would be no such thing as an on-chip cache heirarchy;
>>>>>>>there would be an L1 cache on-chip and -maybe- an L2 cache on the motherboard.
>>>>>>
>>>>>>
>>>>>>It isn't the "size" that counts.  It is the physical space that counts.  To make
>>>>>>it larger,
>>>>>>you limit where you can put it on the chip.  The farther away, the longer it
>>>>>>takes to
>>>>>>access the data.  Small caches can be tucked into holes here and there.  But not
>>>>>>a 2mb
>>>>>>cache that takes 10-12M transistors just for the flip-flops for the bits, not
>>>>>>counting the
>>>>>>addressing and so forth.
>>>>>
>>>>>8 times the number of registers = 8 times the size...
>>>>>When you include system registers, I would expect much closer to 1 KB or so of
>>>>>register space, and perhaps even more. I forgot to also include the segment
>>>>>registers, each a bit bulkier than 8 bytes. You said that size limits the
>>>>>distance. If so, then how does a large amount of register space -not- limit the
>>>>>distance?
>>>>
>>>>You don't need _that_ large an amount.  IE 16kb of register space is huge.  And
>>>>it is certainly doable on-chip and close by.  Even 1K is probably more than
>>>>enough to get thru at least 4-way SMT, that is 256 registers which is a bunch.
>>>
>>>8 GPRs * 4 bytes = 32 bytes
>>>1 flags * 4 bytes = 4 bytes
>>>8 FPU regs * 10 bytes = 80 bytes
>>>6 segregs * 8 bytes (actually more, CPU unpacks the data) = 48 bytes
>>>8 SSE regs * 16 bytes = 128 bytes
>>
>>renaming doesn't apply to fp/SSE so far as I know.
>>
>>
>>
>>>4 control regs * 4 bytes = 16 bytes
>>>6 debug regs * 4 bytes = 24 bytes
>>>1 TR * 4 bytes (probably more with cache...) = 4 bytes
>>>1 GDTR * 6 bytes = 6 bytes
>>>1 IDTR * 6 bytes = 6 bytes
>>>1 LDTR * 6 bytes = 6 bytes
>>>
>>>I lost count of the MSRs around 128, and I scrolled through 4 or 5 pages more. I
>>>would estimate at least 160. Not all are writable, and not all are 64-bits wide.
>>>However, it would be safe to assume (I think) that you have at least 100*8 bytes
>>>= 800 bytes of r/w MSRs. (Most of the space is consumed by MSRs related to
>>>performance counters and machine check.) You have an additional 354 bytes of
>>>register space from the above register set, most of which can't be cached in
>>>main memory.
>>>
>>
>>I don't think any of those are renamable.  Only the 8 primary registers play
>>into that, plus the flags, so far as I recall...  but that is a recollection
>>and I did not go back and verify it in the old Intel papers I have around.
>>
>>
>>
>>
>>>I did exclude the test registers, by the way, because even though they are r/w,
>>>they're used to latch commands to the TLB for testing.
>>
>>Again, none of those are in the rename space.
>>
>>
>>>
>>>The segregs, GPRs, and LDTR all get loaded from the TSS. They can be discarded
>>>at will (for a hefty price in most cases). However, the bulk of the register set
>>>can't be.
>>>
>>>Also consider that AMD is pushing their 64-bit extensions to IA-32, and it seems
>>>probable that Intel will actually follow AMD's lead for the second time in the
>>>history of x86. There are rumors (which Intel will neither confirm or deny) that
>>>Intel is working on a chip called Yamhill which is compatible with AMD x86-64.
>>>This doubles the number of SSE registers, doubles the size of the GPRs, and
>>>doubles the width of the GPRs. That's 96+128=224 bytes more that you'll need.
>>
>>
>>Are you sure you understand the register renaming idea?  You don't duplicate
>>the entire set of registers.  You simply have a set of registers that the
>>rename table can point to.  And most registers are not renamed since they
>>are not often the subject of  something like:
>>
>>      mov eax, something
>>      inc eax
>>      move something,eax
>>      move eax, something_else
>>      dec eax
>>      mov something_else,eax
>>
>>Register renaming turns that apparent set of operand conflicts (eax) into
>>two completely independent sets of three instructions that use a different
>>register for each set...
>>
>>There would be no point in doing that for MSRs, and the like...  they aren't
>>used like the above.
>
>Yes, but what I'm saying is that you need register space somewhere for these
>other registers, and a logical CPU needs to be small to fit on-die. If memory
>serves correctly, the Athlon has about 40 rename registers. One could expect a
>similar number from the P4.

OK... I misunderstood your comment in that regard.  However, something you
probably didn't think about.  _all_ registers don't need to be replicated.

For example, some of the MSR stuff _can't_ be replicated.  How could you tell
one logical cpu to handle this block of memory in one way, and another logical
cpu to handle this block of memory in another way, when both use the samw
cache, etc.  But in the general case, you are certainly right that multiple
logical cpus require separate "states" for the most part.


>
>What I'm trying to say is that you won't need a huge rename space, but other
>things grow fast. SSE is big, and I don't think it's renamed either. It's slow
>because it requires

OK... we agree there..

However, remember one point.   10 years ago architects were asking "what can
we eliminate to make this cpu fit into X transistors?"  Today they are asking
"what can we do with all these transistors?"  10 years ago the sun sparc didn't
have integer multiply/divide.  Today we have FPUS with sine, cosine, etc.  Next
year chips will have 100M transistors.  Taking 5M for a megabit of SRAM won't
raise any eyebrows I'll wager.  :)


>
>>>>>>>The other argument worth making is that HT will hit diminishing returns very
>>>>>>>quickly. It -may- not even be worth going to quad-HT. The main reason why HT
>>>>>>>gets any performance gains is because two threads don't fully utilize the CPU's
>>>>>>>execution capacity. It is convenient when a cache miss occurs because one thread
>>>>>>>can utilize the full capacity of the processor, but across -most- applications
>>>>>>>that is rare. Additionally, the processor has the ability to speculatively fetch
>>>>>>>data and code. Cache misses are rare.
>>>>>>
>>>>>>They aren't rare.  And you aren't thinking the math thru.  Intel hopes for 95%
>>>>>>hits.
>>>>>>That leaves 5% misses.  And for each miss, we sit for 400 clocks or so waiting.
>>>>>>IE
>>>>>>one out of every 20 accesses misses, and the slowdown is tremendous.  HT fills
>>>>>>in
>>>>>>those long idle periods pretty well and going to 4 or 8 should not hit any
>>>>>>asymptote
>>>>>>assuming they can feed micro-ops to the core quickly enough...
>>>>>
>>>>>They can feed up to 3 u-ops/cycle. Splitting 3 u-ops/cycle among 4-8 threads...
>>>>>That also neglects the fact that now you have 4-8 threads competing for main
>>>>>memory, so it may take 4-8 times longer to fill a request.
>>>>
>>>>Of course it might.  But that is _the_ point.  For those cases, SMT won't do
>>>>much.  Any more than running two chess engines on a single cpu is very
>>>>efficient.  But for the right mix the gain is significant.  IE I can run one
>>>>compute-bound process and one heavily I/O-bound process and if they each take
>>>>one hour to run by themselves, they will not take much more than one hour to
>>>>run together.  Run two I/O jobs together and they take two hours.  Ditto for
>>>>two compute bound processes.  But for reasonable "mixes" the gaine is
>>>>significant, and for the optimal mixes, the gain is tremendous...
>>>>
>>>>SMT is no different.  It won't speed up all kinds of applications.  But it will
>>>>speed up many.  And in particular, it is very effective for threaded apps that
>>>>share everything thru the L1/L2 caches anyway...  making cache-thrashing less of
>>>>a problem.
>>>
>>>Cache thrashing isn't a problem if your scheduler knows well enough to schedule
>>>your other thread instead of John Doe's other thread on the physical CPU you're
>>>sharing.
>>
>>_if_ it is ready to run.  If it is ready to run at the time the second logical
>>processor becomes idle.  Etc...
>
>If the thread is ready to run, it can get scheduled on the second logical CPU.
>If both threads are scheduled for the same quantum, they'll execute
>concurrently. They won't execute interleaved; they execute -concurrently-. John
>Doe can steal away some of your execution bandwidth. Even if he's doing I/O, I/O
>can cost a lot of execution cycles (doing things like thread transitions, memory
>copying, etc.)

How is he going to _steal_ from me when he is _not_ running?  Except that in
the case of disks, if we both go to the same disk then we get hung up behind
each other on one I/O device.  And the threads don't necessarily execute
concurently.  The physical cpu has to make a choice when both threads are ready
to run.  Just like the process scheduler has to make a choice when two
processes are in the ready state.  In both cases, the cpu executes "slices"
of each process, interlaced.

It _must_.

>
>A scheduler that doesn't know about HT can impede your performance unless you're
>fairly certain that you're going to occupy all logical CPUs in the system.

Absolutely.  I screwed up some test results yesterday.  I ran two single-cpu
tests at the same time, which was fine on my quad xeon.  But on my SMT
machine, on one test my nps was 1/2 of what it should have been.  Damned system
scheduled two processes on one physical cpu leaving the other idle.  The kernel
guys are looking at this.  I am studying it myself and might send them some
scheduler patches if I get it to work reasonably.

>
>>>>>>>One of my machines has a BIOS option to disable the on-chip caches. When I
>>>>>>>disable it, my 1.2 GHz Thunderbird runs extremely slow. Every memory access
>>>>>>>effectively becomes a cache miss. If you have a machine with this option, you
>>>>>>>can try it and see. If cache misses happened often enough to make a viable
>>>>>>>impact on HT, you wouldn't see a big difference.
>>>>>>
>>>>>>Again, you are not thinking thru on the math.  Turn L1/L2 off and _everything_
>>>>>>takes 400 clocks to read.  Turn it on and 95% of the stuff doesn't, but if you
>>>>>>know
>>>>>>anything about Amdahl's law, that last 5% is a killer, meaning that even if you
>>>>>>drop the cpu clock cycle time to zero picoseconds, the cpu could not run 20x
>>>>>>faster
>>>>>>due to that last 5% being bound by memory speed, not processor clock cycle time.
>>>>>
>>>>>Amdahl's Law applies to parallel computing. Memory accesses aren't really a
>>>>>parallel computation problem...
>>>>
>>>>They are in this context, because you can do anything you want to that 95%
>>>>of the code that runs in cache, but that last 5% is the killer and that is the
>>>>part that makes SMT work...
>>>>
>>>>
>>>>
>>>>>
>>>>>Actually, it doesn't take 400 clocks to read memory; it takes 105 clocks for my
>>>>>AthlonMP 1600 to read a random segment of memory for me (multiplier * 10 bus
>>>>>clocks). That assumes that you have to re-latch RAS. If my processor doesn't
>>>>>need to re-latch RAS, then it takes only about 27 clocks. Most memory accesses
>>>>>are stack accesses, and stack accesses are almost -always- cache hits.
>>>>
>>>>It only takes 10 clocks on a 100mhz machine.  But I am talking top-end which is
>>>>in the 3,000mhz range.  And there it is 300+ clocks, period...  You keep saying
>>>>that same thing about the stack.  And it is true for _some_ programs.  It is
>>>>_not_ true for many.  Crafty is one prime example that doesn't use much stack
>>>>space between procedure calls.  I have several other similar programs here
>>>>such as a couple that do molecular modeling, some simulation codes, and
>>>>similar things that use very large static arrays of data.
>>>
>>>A 3.06 GHz Pentium 4 has a multiplier of 23.0. On pc2100 DDR SDRAM, this is 230
>>>clocks worst-case, 43-56 clocks when RAS latency isn't incurred.
>>
>>I will say this again, carefully.  I _run_ on these machines.  I haven't found
>>a single machine in a year that can produce that kind of low latency.  Just do
>>random accesses, spread out at _least_ as long as a cache line, and run it for
>>a good while.  I get 100-130ns as the average latency.  At 3.06ghz, that turns
>>into 300 clocks +...
>
>On SMP it's worse. On RDRAM it's worse. And 230 clocks = 97 ns.
>
>A random access is that high. A typical access in many compute-bound
>applications is not. When you hit the same row in memory, you won't incur the
>full latency. When you're always doing random probes to memory, yes, it will be
>higher. A lot of compute-bound applications aren't completely random memory
>probes. The central assumption that I have made is that Intel is marketting HT
>to a broader compute-bound market. It obviously means nothing (well, maybe
>marketting hype) to the desktop market.

Again, we don't disagree in general.  My point is that a read "over there" to
a new place is going to take > 100ns.  Another read "over there" might be
blazingly fast.  If I do it...  But when I get a "wild hair" to read something
"over there" it is going to be expensive.


>
>>Someone in r.g.c.c. argued the same point.  But when he ran the above test,
>>he came back and reported "You are correct, I get 135ns latency, not the
>>60 I was expecting"
>
>I am not someone in the r.g.c.c., and what I have said approximately agrees with
>your figure. I realize now that I did not include the latency from the
>northbridge to the CPU, which accounts for the larger figure. However, the
>figure is still smaller than 135 ns in most cases.

I think the 135 figure came from RDRAM.  non-rambus was closer to 110/120 or
so...

>
>Yes, a -random- probe will incur around that much latency. I have dotted my i's
>and crossed my t's here and been careful to state every time that the smaller
>figure is when you don't need to relatch the row. Not all applications require
>random probes. A few do some random probes and a lot of serial access.

I understand.  The point being, however, that _every_ memory read is going to
incur a huge penalty.  Whether it is a random-access 120+ns, or a block read
that lowers the average latency to 2/3 of that doesn't really matter to SMT.
a couple of hundred clocks waiting on memory gives another thread or two or
three _plenty_ of time to smoke thru some micro-ops and benefit.


>
>I also did some measurements. On serial probes, I incur about 52 cycles of
>latency (37 ns) on my AthlonMP 1600 system. This is a multiplier of 10.5 or
>effectively 5 bus cycles. Random probes are 138 ns latency for me which is
>approximately equal to your figure. It would also completely agree with my
>previous statements if northbridge latency were taken into account.
>
>Notice the correlation between what I have said and my data, by the way. The RAS
>latency on my system is 7 bus clocks, and the CAS latency is 2.5 clocks. The
>ratio of random:serial (RAS + CAS / CAS) access latency should be 3.8. Empirical
>data shows 3.72, but it is skewed low by a constant bias in the northbridge
>latency.

OK... but even for serial memory reads, you suck in 64 bytes (or whatever AMD
uses for L2 linesize) and that still takes 60 clocks.  With nothing that can
be done in the meanwhile.  That second SMT thread would have 60 clocks to
do something that _does_ fit into L1/L2 and avoid the memory bottleneck...



>
>>>AMD was, I believe, the first to introduce explicit prefetch capabilities to
>>>their K6-2. Intel added prefetch capabilities in the P3, and every CPU since has
>>>had it. It doesn't benefit all applications, but applications that can operate
>>>on "very large static arrays of data" benefit quite nicely, particularly when
>>>manipulations take more than a few cycles of time. I was reading a nice document
>>>from Intel the other day published c.a. '99 about computation of prefetch
>>>scheduling distance.
>>
>>Notice I am talking about random-access.  No prefetching in the world will
>>help that.  And that is the worst-case latency example.  And it is going to
>>be well over 100ns no matter what all the prefetching examples suggest...
>>
>>prefetching will actually hurt, in that case.
>
>Prefetching will never hurt unless you speculatively load data that you never
>use.

That was the point.  And when it hurts it _really_ hurts.  As you just doubled
latency.

>
>You are talking about random access, but I fail to see how -ALL- compute-bound
>applications never hit serialized memory access. My thought is that the average
>compute-bound application that Intel is targetting does far more serial memory
>access than random memory access. Admittely it's a difficult point to argue
>since it's hard to get an average sample of applications and say which case is
>true. Additionally, many access memory with non-deterministic patterns.
>


Clearly serial memory referencing is done.  More by some apps than by others.
In the random access case, the SMT cpu has a _bunch_ of clock cycles to spend
on another thread or three.  In the serial case, it only has "half-a-bunch"
but that is _still_ a significant amount of cycles that would be lost forever
without SMT.

>>>>>>>>>HT is merely a way to make the existing hardware more efficient. If it were
>>>>>>>>>anything more, it would add -additional- hardware registers so the OS could
>>>>>>>>>control the scheduling algorithm and specify the location of the ready queue. It
>>>>>>>>>would also add instructions that would allow the processor to switch tasks.
>>>>>>>>
>>>>>>>>The processor _already_ is doing this.  But for processes that are ready to
>>>>>>>>run rather than for processes that are long-term-blocked for I/O, etc.
>>>>>>>
>>>>>>>Yes, but the scheduler's job is to pick who runs, when they run, and how long
>>>>>>>they run. HT only affects the first by allowing the scheduler to pick two tasks
>>>>>>>to run instead of just one. HT isn't replacing the scheduler; it only
>>>>>>>complicates it.
>>>>>>
>>>>>>No.  It simplifies the task when I can say "here run both of these".  Rather
>>>>>>than having
>>>>>>to bounce back and forth as a quantum expires.  It will get better when HT goes
>>>>>>to
>>>>>>4-way and beyond...
>>>>>
>>>>>Again, as far as the OS is concerned, HT = SMP.
>>>>
>>>>Certainly...  I hope I never said otherwise. But price-wise, HT != SMP
>>>>by a huge margin.
>>>
>>>Depends on how you look at it. My AthlonMP box cost about $1,100 for the
>>>-entire- system when it was brand new, and I opted for a nice case and 1 GB of
>>>ram. Intel is charging $700 for an HT-enabled P4, more than what my board with
>>>both chips cost. :-)
>>>
>>>Undoubtedly though it isn't costing much extra to add HT capabilities to the
>>>chip.
>>
>>
>>This is a different issue.  Intel is Intel.  AMD is AMD.  Why does a Lincoln
>>sell for more than a Mercury?  Quality is not that different.  Name recognition
>>is, however.
>
>It is. It just amused me to make note that it is possible to get SMP for about
>the same cost if you consider other brands. Certainly though it's not costing
>Intel more than nickels and dimes, if that, to add HT to their high-end CPUs.

I'm not a fan of either Intel or AMD, personally.  I'm a fan of Cray.  Always
have been.  And if we talk micros, I am a fan of the alpha, may it rest in
peace.  :)

However, the intel guys have done some good things.  As have the AMD folks.

But a couple of bad K5 experiences has left me cold with respect to AMD.





>
>>>>>Are you trying to explain to me that SMP schedulers are simpler than single-CPU
>>>>>schedulers? Synchronizing ready queues and picking multiple processes out is
>>>>>easier than not synchronizing and picking the one off the top?
>>>>>
>>>>>-Matt
>>>>
>>>>
>>>>Let's define simpler.  Do you mean "less source code"?  If so, no.  If you mean
>>>>"simpler mechanism to choose between two compute-bound processes" then a
>>>>resounding _yes_.  Let's offload the process scheduling, the quantum nonsense,
>>>>and everything else related to that into the CPU hardware and right out of
>>>>something the O/S is spending time doing.  Bouncing between two tasks at the
>>>>O/S level is a _painful_ thing to do in terms of overhead cost.  Doing it at the
>>>>SMT level is essentially free...
>>>
>>>I would still say no to the "simpler mechanism to choose between two
>>>compute-bound processes." The quantum nonsense still exists, and process
>>>scheduling still occurs. It occurs twice as often.
>>
>>With two compute bound processes it _never_ occurs.  Which is the point.
>>Because there are two processors and two runnable processes.  There is no need
>>to have quantum runout and run the process queue, when you _know_ there is
>>nothing to run.
>>
>>I can't vouch for all systems, but the old xerox MP system I did O/S work on
>>certainly behaved like this...
>
>I'll check tomorrow how NT's scheduler works if I get a chance. I don't think NT
>does it, anyway. I've seen the scheduler for the Linux kernel, and I don't
>remember anything like that. Presumably most desktop OSes don't. There is no way
>for a process to communicate to the kernel that it is compute-bound on NT or
>OS/2, and I don't know of any way to do so on garden-variety System V OSes,
>though that is not my paradigm.
>

Linux has the "hard clock" code.  Each time it is triggered it updates the TOD
clock and then asks "is there anything else to run?" if the current quantum has
expired.  If the answer is no, it just keeps on trucking...


>It would also be a nuisance for fault-tolerant systems. In the case of SMP, you
>have no guarantee that you'll ever be interrupted by anything besides the timer.
>Disabling the timer = cutting your own throat.

Didn't say "disable the timer."  Although it would be moot for fault-tolerant
as if interrupts get jammed you are busted anyway.  I said "ignore the timer"
since if no interrupts arrive, no process can possibly change its state from
blocked to ready...


>
>That actually -can- be resolved on x86 for a generic/real-time OS, I think. It
>takes a bit of extra work. I did not know until earlier tonight that the Intel
>timer chip had a watchdog timer built-in. It generates an NMI at intervals, and
>it ticks at a maximum of 18.2 ms. In a good implementation, the interrupt
>handler could execute very quickly without a task switch or mode switch. From
>what I understand, NT uses the watchdog timer, but it still doesn't allow a
>processor to go compute-bound.
>
>It would also be possible to do a "warm reboot" (slow because it requires going
>through BIOS init of the CPU) of the particular CPU in question. It would be
>very annoying to implement as you would have to implement a software watchdog
>timer and delegate it to a CPU that can't go compute-bound. If the scheduler is
>heavily optimized, it would be easy to excuse not implementing that feature.
>
>>>Obviously the OS scheduler -can't- schedule with such a fine granularity because
>>>of scheduling overhead. HT would simplify a scheduler that tried to, but since
>>>none can, it's more of an extension to existing schedulers. It doesn't simplify
>>>anything; it just makes it possible to schedule at sub-nanosecond intervals.
>>>
>>>-Matt
>>
>>It lets me pick the only two runnable processes, get them going, and then go
>>away until an interrupt occurs.  That is more efficient.
>
>Sort've. You have to keep in mind that HT looks, smells, and feels like SMP.
>What actually happens is you have 2 interrupts occur, 2 instances of the
>scheduler pick 2 processes to schedule, run them, and then wait for 2 interrupts
>to occur. Without HT, an interrupt occurs, it picks 1 process to execute, runs
>it, and waits for the interrupt to occur. No need for locking code if you don't
>have SMP.


Actually the interrupt from the timer occurs every fixed interval of time
regardless of what is going on, since the PC is based on a real-time clock.
And you don't get multiple timer interrupts, you should see just one, which
runs thru the APIC to one processor, which updates everything and might have
to tell another "hey, check your process, it's been running a while."






>
>Over the duration of a single quantum, HT means you have to make 2 choices with
>locking semantics, and lack of HT means you only have to make 1 choice and may
>not need locks. As above, general/real-time OSes can't let a buggy program run
>away with a CPU. If the scheduler disabled the timer and let the processes run
>without needing to make another choice, they could potentially dominate the
>entire system. Schedulers don't do that anyway, though it would be nice.

>
>I was intending to implement that feature where the scheduler would go away
>until a compute-bound process was done, but dialog in this thread has forced me
>to reconsider. I make real-time guarantees.
>
>-Matt


You still can.  You don't turn the timer off.  But when you decide that
"quantum is up" and nothing has changed, you just reload, restart, and
continue as is...



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.