Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Magic 200MHz

Author: Tom Kerrigan

Date: 17:04:28 05/24/03

Go up one level in this thread


On May 24, 2003 at 17:54:37, Robert Hyatt wrote:

>On May 24, 2003 at 16:23:26, Tom Kerrigan wrote:
>
>>On May 24, 2003 at 01:12:34, Robert Hyatt wrote:
>>>On May 23, 2003 at 23:45:09, Tom Kerrigan wrote:
>>>>First of all, okay, sure, let's say you're right and only SOME of the resources
>>>>are split. Even if only the write combine buffers are split, and you have a
>>>>program that works great with 4 buffers but starts "thrashing" with 3 buffers,
>>>>don't you see how that would cause the program to run inordinately slow with HT
>>>>on? Or if the processor can extract great parallelism from the instruction
>>>>stream with an n entry reorder window but very little parallelism with an n/2
>>>>window?
>>>
>>>Back to _real_ data.  I run crafty twice.  I get a different level of
>>>performance than if I run crafty _once_ using two threads.  Yet both have
>>>the same instruction mix.  Locks are infrequently used so that isn't the
>>>problem.  However, cache coherency _is_ an issue and is most likely at
>>>the bottom of this mess for my case.  Invalidating whole lines of cache
>>>is worse when a line is 128 bytes than when it is only 32 bytes.  Whether
>>>that is the problem or not is not yet proven, just a pretty well-thought-out
>>>"hunch".
>>
>>Try to think it out more. How could the cache prefer one thread over the other?
>>I don't see how this is possible with any reasonable design. It's easy enough to
>>test by writing a simple program, so why don't you do that? And anyway, this
>>STILL doesn't address my point, which is how HT can cause performance to
>>degrade.
>
>Why don't you "try to think it out?"
>
>"cache coherency" has _nothing_ to do with "cache favoring one thread over
>another."  It has _everything_ to do with cache lines getting invalidated
>which throws out 128 bytes on PIVs as opposed to 32 bytes on PIIIs.
>
>I don't _need_ to write a test program.  I already _have_ one that is
>causing the problem...

First of all, what in the WORLD do P3s have to do with ANYTHING?

Second, you're right, I didn't think it out. Since both logical processors use
the same caches, there IS no cache coherency problem. In fact, cache coherency
is completely unrelated to hyperthreading, isn't it? I don't even know why you
would use the two terms in the same sentence. But here you are, using "cache
coherency" as an explanation for one logical processor running a thread faster
than the other. ("Well-thought-out hunch," sure.)

>>No, there's a difference between HT being enabled and HT being active. If only
>>one thread is being run, all of the CPU's resources are "merged" back together.
>>The Intel slides indicate this and I wrote as much in reply to Eugene's post.
>
>That makes _zero_ sense.  Define "idle"?  Typically, in any O/S I know of,
>all processors are _always_ executing an instruction stream.  Even if it is
>just the process scheduler "idle loop".
>
>So that makes _zero_ sense as a suggested cause.

You're pretty smug for a guy who has absolutely NO idea what he's talking about.

http://developer.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p09_task_modes.htm

"On a processor with Hyper-Threading Technology, executing HALT transitions the
processor from MT-mode to ST0- or ST1-mode, depending on which logical processor
executed the HALT. For example, if logical processor 0 executes HALT, only
logical processor 1 would be active; the physical processor would be in ST1-mode
and partitioned resources would be recombined giving logical processor 1 full
use of all processor resources. If the remaining active logical processor also
executes HALT, the physical processor would then be able to go to a lower-power
mode."

>I have concluded that _most_ of the resources are dynamically allocated between
>the two logical processors "as needed".  That seems to fit all the discussing
>in comp.sys.* for the past three years...

Again, you have no clue.

http://developer.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p05_front_end.htm

"After uops are fetched from the trace cache or the Microcode ROM, or forwarded
from the instruction decode logic, they are placed in a "uop queue." This queue
decouples the Front End from the Out-of-order Execution Engine in the pipeline
flow. The uop queue is partitioned such that each logical processor has half the
entries."

"The out-of-order execution engine has several buffers to perform its
re-ordering, tracing, and sequencing operations. ... Some of these key buffers
are partitioned such that each logical processor can use at most half the
entries. Specifically, each logical processor can use up to a maximum of 63
re-order buffer entries, 24 load buffers, and 12 store buffer entries."

Of course, I'm sure you already knew all of this, what with having read most
everything on Intel's web site.

BTW, I assume you're doing your spin waits with the PAUSE instruction, otherwise
your HT performance will be all jacked up when using threads:

http://developer.intel.com/technology/hyperthread/intro_nexgen/sld025.htm

-Tom



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.