Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: But, Re: Questions re P4 3.03 with HT ??

Author: Robert Hyatt
Date: 06:57:23 12/11/02
On December 11, 2002 at 02:34:33, Matt Taylor wrote:

>On December 11, 2002 at 00:28:09, Robert Hyatt wrote:
>
>>On December 10, 2002 at 23:11:36, Matt Taylor wrote:
>>
>>>On December 10, 2002 at 22:54:45, Robert Hyatt wrote:
>>>
>>>>On December 10, 2002 at 21:19:18, Matt Taylor wrote:
>>>>
>>>>>On December 10, 2002 at 21:13:28, Robert Hyatt wrote:
>>>>>
>>>>>>On December 10, 2002 at 20:33:34, Jeremiah Penery wrote:
>>>>>>
>>>>>>>On December 10, 2002 at 20:18:16, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On December 10, 2002 at 20:12:06, Jeremiah Penery wrote:
>>>>>>>>
>>>>>>>>>On December 10, 2002 at 20:00:11, Robert Hyatt wrote:
>>>>>>>>>
>>>>>>>>>>On December 10, 2002 at 16:43:29, Matt Taylor wrote:
>>>>>>>>>>
>>>>>>>>>>>They said that HT allows -concurrent- scheduling of threads, but the threads
>>>>>>>>>>>obviously cannot make use of the same execution resources. If this is correct,
>>>>>>>>>>>one thread would be spinning (consuming bandwidth to the L1 cache) while the
>>>>>>>>>>>other thread was doing real work.
>>>>>>>>>>
>>>>>>>>>>Again, think about what you just said, which is impossible to happen.  If one
>>>>>>>>>>thread is smoking the L1/L2 cache, then it is not waiting for _anything_ and
>>>>>>>>>>once it is scheduled it will execute until the cpu decides to flip to the other
>>>>>>>>>>thread.  Or until that thread does a pause.  Whichever comes first.
>>>>>>>>>
>>>>>>>>>The point is that the spinning thread blocks no execution units.  The processor
>>>>>>>>>can spin the idle thread all it wants, why should that stop it from scheduling
>>>>>>>>>the second thread, which _will_ use the execution units, to run at the same
>>>>>>>>>time?
>>>>>>>>
>>>>>>>>
>>>>>>>>I don't follow.  The "spinning thread" completely fills the integer pipe...
>>>>>>>
>>>>>>>Processors have more than one integer pipe, and I'm sure that a spinning thread
>>>>>>>doesn't fill more than one.  In a P4, which has dual-pumped ALUs, a spinning
>>>>>>>thread wouldn't even block a single pipe.  That is, if the scheduler were smart
>>>>>>>enough to schedule other thread(s) to fill that unit.
>>>>>>
>>>>>>Somehow we are not on the same page. A single tight compute-bound loop can
>>>>>>_completely_ fill one pipe by itself with _no_ problems.  The micro-ops
>>>>>>will simply stuff that pipe totally as every branch will be predicted
>>>>>>correctly...
>>>>>>
>>>>>>And if that thread is sucking up the cpu, the _other_ thread is going to
>>>>>>be hindered since it can probably use _everything_ in the CPU when it is
>>>>>>running...
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>The cpu doesn't execute two threads at a time, it flips and flops back and
>>>>>>>>forth between them.  The spinning thread will _never_ give up control and has
>>>>>>>>to be either preempted by the cpu, or else it has to do a pause, as explained
>>>>>>>>in the intel white-paper on the subject...
>>>>>>>>
>>>>>>>>Otherwise the pause would _not_ be needed...
>>>>>>>
>>>>>>>What's the point of hyper-threading if two threads don't run at the same time?
>>>>>>>Yeah, sure, you can execute while one thread waits on memory or something, but
>>>>>>>it's certainly not the most efficient use.  All the documentation I've seen
>>>>>>>suggests that if one thread is using, say, half the integer pipes, that another
>>>>>>>thread can be scheduled concurrently to use the other half of the pipes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>What is the point in an operating system for executing two processes at the
>>>>>>same time?  Because one blocks and the other uses those unused cycles.  That
>>>>>>is the _only_ point of running more than one process at a time.  That is the
>>>>>>only point for hyper-threading also.  It has just moved a bit of the process
>>>>>>scheduling down into the CPU.  The OS feeds the CPU two candidate processes
>>>>>>to "interleave" and the CPU does that at the hardware level, more efficiently.
>>>>>>
>>>>>>As far as sharing pipes, that can happen.  But if one thread is burning one
>>>>>>pipe up doing useless work, that is lost cycles that the other thread can't
>>>>>>get to.  Which is _the_ point for the "pause" instruction...
>>>>>
>>>>>The integer pipe feeds into 5 integer execution units which can be accessed
>>>>>concurrently each cycle. However, a spin-wait loop will only be able to use 1
>>>>>unit because of register dependenies.
>>>>
>>>>
>>>>Not necessarily.  Look at "ThreadWait()" in Crafty.  It is a more complicated
>>>>"spin wait" that is testing several things in the same loop...  but
>>>>irregardless, of whether it is one execution busy or two or three, it does
>>>>_not_ matter.  That is one execution unit that the other thread can't get
>>>>to, which is the point for the "pause" instruction.
>>>>
>>>>Otherwise the "pause" is pointless.  Why do you think they implemented that?
>>>>And why do you think they wrote a 7-8 page paper describing how to do
>>>>spinlocks and spinwaits using the pause instruction?
>>>
>>>Here are the first two paragraphs on the pause instruction from the P4 manual. I
>>>did not continue past that because the manual digresses from function and talks
>>>about compatibility, exceptions, pseudo-code, etc.
>>>
>>>IA-32 Intel Architecture Software Developer's Manual Vol. 2: Instruction Set
>>>Reference
>>>Order 245471-006
>>>
>>>Page 586/966: Pause -- Spin Loop Hint
>>>
>>>Improves the performance of spin-wait loops. When executing a "spin-wait loop,"
>>>a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when
>>>exiting the loop because it detects a possible memory order violation. The pause
>>>instruction provides a hint to the processor that the code sequence is a
>>>spin-wait loop. The processor uses this hint to avoid the memory order violation
>>>in most situations, which greatly improves processor performance. For this
>>>reason, it is recommended that a pause instruction be placed in all spin-wait
>>>loops.
>>>
>>>An additional function of the pause instruction is to reduce the power consumed
>>>by a Pentium 4 processor while executing a spin loop. The Pentium 4 processor
>>>can execute a spin-wait loop extremely quickly, causing the processor to consume
>>>a lot of power while it waits for the resource it is spinning on to become
>>>available. Inserting a pause instruction in a spin-wait loop greatly reduces the
>>>processor's power consumption...
>>
>>
>>That is prior to SMT.  The speculative execution can load up multiple iterations
>>of a spin-lock into the pipe and that causes problems since the CPU can do out-
>>of-order writes and that can lead to errors when fiddling with cache lines.  The
>>pause prevents more than one iteration to enter the pipe avoiding that problem.
>
>No, it's not. Intel was talking about SMT before the Pentium 4 was even being
>sold. The pause instruction was probably implemented prior to SMT, but the
>fxsave and fxrstor opcodes (the precursors to SSE) were implemented prior to
>SSE's induction for likely the same reason. The fxsave and fxrstore opcodes were
>implemented so that the framework for SSE would be possible by the time Intel
>implemented SSE.

I am talking about the "explanation" they give.  It sounds like it was written
prior
to SMT.  And the part about burning less power is off-the-wall IMHO as if there
is
only one thread running, the pause doesn't have much of an effect on power
dissipation since that thread keeps right on running with very small delays
factored
in for each pause to prevent the recursive loop problem.
>
>>But for hyper-threading, it does more.  If you go to intel and search for
>>hyper-threading you can find at least a couple of papers that discuss the
>>spin-lock with hyper-threading issue in detail.
>>
>>I don't know that the power-consumption thing is true as that is what is
>>recommended as the reason for using a "halt" to stop a thread from buzzing
>>a cpu (logical cpu) when possible, although a normal user has to resort to
>>"pause" as a second-best option.  This from the "long spin-wait and hyper-
>>threading" article on the intel developer's site.
>
>The OS normally executes the hlt instruction in the idle thread, so it wouldn't
>be terribly useful for the user.

Not anywhere that _I_ know of.  It simply context-switches to _another_ process
that
is ready to run, or in most cases it runs a special "idle-loop" process that
just burns cpu
cycles so that they are accounted for at the end of the day in the accounting
statistics.

The problem here, however, is that the process is not "idle", of course.  It is
in a tight
while() loop waiting on a shared variable to be modified by another thread.

> Also, hlt "blocks" until an interrupt is issued
>or the RESET line is triggered. The whole point of the pause instruction is that
>it blocks for a limited number of cycles.

I don't read it as being "limited".  It prevents recursive speculative execution
of
a spinloop, to avoid the out-of-order write problem.  But it also causes that
thread
to "step out of the way of the other thread, assuming it is doing something
useful."

in posix threads, we have a function sched_yield() that does _exactly_ the same
thing
but at the O/S level rather than in the hardware for the two threads the CPU is
managing.

>
>Eugene's explanation fits, though. I am suprised that Intel did not duplicate
>the trace cache for both logical CPUs. It's like trying to fit an even bigger
>peg into an already too small hole...
>
>-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.