Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Magic 200MHz

Author: Robert Hyatt

Date: 13:39:08 05/27/03

Go up one level in this thread


On May 27, 2003 at 13:23:24, Tom Kerrigan wrote:

>On May 27, 2003 at 11:05:27, Robert Hyatt wrote:
>
>>>So how do you explain your statement that no OSs you've "tested" issue halts? I
>>>mean, Linux issues halts. Did you not "test" Linux?
>>
>>I'll leave that for _you_ to figure out.  You can find an explanation
>>in the "scheduler idle loop" code.
>
>Suck it up, Bob, and admit you were wrong. It's painfully obvious that you're
>not contradicting me, just handwaving and backpedaling enough to give yourself a
>heart attack. "I fiddle with the source code" and "I'll leave that for you to
>find out." Yeah, right, Bob. Do you think that if you continue with this asinine
>behavior, everybody will get confused and just assume you're right?
>
>From LinuxHQ, "The Linux Information Headquarters,"
>
>"Regardless, you should be aware that even if you don't enable any power
>management on your laptop, on the x86 architecture Linux will always issue the
>"hlt" instruction to your processor whenever nothing needs to be done. This
>results in lowering the power consumption of your CPU. Note that the system
>doesn't power down when it receives the hlt instruction; it just stops executing
>instructions until there is an interrupt."
>
>http://www.linuxhq.com/ldp/howto/mini/Battery-Powered/powermgm.html
>
>I can find a dozen other pages that say Linux issues halts at the drop of a hat.
>Just say the word and I'll make this even more embarrassing for you. (Although
>that's hard to imagine, isn't it?)

Not embarassing for me at all.  The point I referred to was that the
scheduler _first_ spins for a while, _then_ issues a halt.  The "for a
while" has varied depending on kernel version, from 333 ms downward,
depending.  For the longest, you had to _configure_ the kernel to execute
a halt at all, and this was primarily used on laptops to conserve power.

On a current linux box, the "second" cpu is _very_ busy.  Handling timer
interupts and anything else that the APIC sends its way.  So you won't find
one cpu sitting "dead" very often in a normal scenario...  Which means the
"resource split" will always be an issue unless you break out the soldering
iron and start disconnecting IRQ lines.

Additionally, many systems run a "idle process" run at a ridiculously low
priority, to account for "lost processor cycles" for accounting purposes.  Such
an "idle task" totally eliminates any "halt" for systems running that kind of
accounting.

Linux goes even further with migration threads and softirq threads.  Halts are
not exactly an N times per second event on linux...

However, if you want to assume they are, we can still discuss the resource
split as there are plenty of interrupts to keep a single idle logical processor
from sitting idle very long.  That's the purpose of the APIC in the first place,
so that idle processors get interrupts rather than processors that are actively
executing userspace/kernelspace code that is useful.

I personally think this "halt" stuff is pretty moot overall.  But it would be
easy enough to remove the halt from the idle loop to see what it does to SMT
performance overall, as a simple way to test.  Several have had to remove it
from new IBM servers because they have some sort of bug that prevents some
interrupts from terminating a halt, introducing big delays in interrupt
processing latency.




>
>>>>First, your statement is wrong.  There have been reports that running a
>>>>single thread with SMT on runs slower than a single thread with SMT off.
>>>>
>>>>That was where this discussion started on RC5 in fact.
>>>
>>>Really... what post was that? Because I can only find posts saying that RC5
>>>slows down with two threads, not one.
>>
>>It was not clear in the post I responded to.  Which is why I specifically
>>asked "did this run one thread on a SMT-enabled cpu or did it run _two_
>>threads?"
>
>That's great, Bob. "It was not clear"?? How can you begin to justify telling me
>I'm wrong when "it was not clear" how many threads were running? (And, BTW, it
>became clear that multiple threads were running.)

It never was 'clear' to me, which is _why_ I asked the question.

However, back to your "idea" about shared write-combine buffers.  You claim
that if a single thread needs 5, then running two such threads will slow
such things down.  I'm not sure I buy that.  Because if you run _two_ threads
they need 10, and they get 8 combined.  The two threads combined should run
_faster_ combined, than a single thread.  I _still_ don't see any particularly
reasonable explanation for why two threads would run slower _combined_ than
one would run by itself.  Even if one thread needs all 8 WC buffers, running
would mean each gets 1/2 of what it needs, and combined they should run at the
_same_ speed.

So, again, I do _not_ follow your logic.  If you want to say "running one
thread that needs 5 will make it run slower if _another_ thread is also running
because that thread steals WC buffers that I need" then I'll buy that.  But
not if _both_ threads are running the _same_ application (and with RC5 that
is certainly the case)...

So back to the drawing board.  And your "RAM example" is flawed and doesn't work
in this context, because when you are out of RAM you go to disk.  But when you
are out of WC buffers you are just out and wait.  There is no extra slow-access
stuff thrown in.


>
>>>>So I don't quite see what your point is unless it is to reinforce _my_ point
>>>>about SMT not slowing things down in any way I can see...  Unless we talk about
>>>>the case of running two threads using two logical cpus being slower than running
>>>>one thread on one real cpu.  I can see where _that_ could cause speed issues in
>>>>lots of ways, particularly with a parallel search.
>>>
>>>Hmm, that's interesting, because you couldn't understand that a few days ago.
>>
>>I've understood that from the beginning.  Because I have reported that SMT
>>does _not_ run twice as fast except for a rare program here and there.
>
>Whoa, you're going to blow me down with the handwaving. Is your point that SMT
>might slow things down (towards the beginning of the quote) or that it's not
>twice as fast (end of quote).

Both.  Parallel algorithms are _not_ by definition as efficient as serial
algorithms. They can be _close_ but never exactly as efficient, because of
locking and shared memory concerns.  So in the best case, a parallel algorithm
on two processors at clock/2 speed will be _close_ to a serial algorithm on
one processor at clock/1 speed.  The closer the better, but it is certainly
limited by whatever synchronization overhead is required.

SMT will never be twice as fast because there are not enough resources within
the CPU to keep both logical processors running full-bore, yet.  One day,
probably.  However, SMT should never be worse overall, because SMT _does_
offer some low-latency process overlap when doing memory accesses or when
waiting on some pipeline/operand conflict to be resolved.  IE when running
Crafty I have measured SMT improvement between 10 and 30%.  Others have
measured from zero to 100% improvement.  I have not seen any negative numbers
but I have not yet had time to try the RC5 test...

And finally, I don't believe that running SMT enabled with two threads should
ever be significantly worse than running just one thread.  If it is, then there
are locking issues, shared memory issues, that ought to be addressed in the
algorithm.  Of course it is possible to write a parallel algorithm that is
much worse than the original serial algorithm.  But who cares?

Did I cover _all_ of my previous statements that time, clearly?


>
>>>"Could it be slower in some?  Of course.  But then the algorithm(s) in question
>>>need work, obviously..."
>>
>>And that is certainly true...
>
>I can't believe you're saying this. Right from the quote above:
>
>"Unless we talk about the case of running two threads using two logical cpus
>being slower than running one thread on one real cpu.  I can see where _that_
>could cause speed issues in lots of ways, particularly with a parallel search."
>
>What did you mean by "a lot of ways" if the only possibility is algorithmic
>inefficiency?

Locks.  They freeze the bus.  They are slow.  They waste processor cycles
spinning that can interfere with the other thread on a SMT processor.  In
the case of parallel search, there are additional issues, including searching
extra nodes because of violating the sequential property assumption of alpha/
beta search, hash table conflicts, and so forth.  Search overhead is a known
algorithm problem.  Locks and things _can_ be worked around in many cases.
Data can be ordered in memory better, and aligned better, and grouped better.
Etc.


>
>>>The "pipes," as you so eloquently call them, involve all the buffers that Intel
>>>says are split, and the memory read/write buffers are also split. The only thing
>>>that may be duplicated, instead of split, is the rename register file. And
>>>really, if you're talking about how balanced the processors are, duplicated
>>>might as well be the same as split.
>>>
>>
>>The "pipes" I am talking about are the "functional units" that actually do
>>integer and floating point operations.  I don't see _any_ documentation that
>>suggests that they are evenly "split" between the logical cpus.  Just because
>>they "involve" the buffers don't mean that they are split evenly...  which
>>testing seems to verify.
>
>Ah, so after days of arguing, you finally come up with one single thing that
>might support your point.
>
>Execution units aren't considered part of the processor's resources by Intel,
>but if you want to talk about them, fine.
>


Then this becomes a "semantics argument".  Because in the world of
computing, "execution units" are part of the CPU.  The Cray has used this
design since the 1970's.




>"The memory instruction queue and general instruction queues send uops to the
>five scheduler queues as fast as they can, alternating between uops for the two
>logical processors every clock cycle, as needed."
>
>So the execution units are split 50-50 temporally.
>

except that _no_ application really drives a processor at 100% duty cycle.
Most are way under 50% which is the point for SMT in the first place.  Operand
conflicts, result stalls, memory accesses, cache line fills, all prevent the
thing from issuing max micro-ops every cycle.  Otherwise SMT would be a failure
from the get-go.




>-Tom



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.