Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Magic 200MHz

Author: Robert Hyatt

Date: 10:54:44 05/26/03

Go up one level in this thread


On May 24, 2003 at 20:04:28, Tom Kerrigan wrote:

>On May 24, 2003 at 17:54:37, Robert Hyatt wrote:
>
>>On May 24, 2003 at 16:23:26, Tom Kerrigan wrote:
>>
>>>On May 24, 2003 at 01:12:34, Robert Hyatt wrote:
>>>>On May 23, 2003 at 23:45:09, Tom Kerrigan wrote:
>>>>>First of all, okay, sure, let's say you're right and only SOME of the resources
>>>>>are split. Even if only the write combine buffers are split, and you have a
>>>>>program that works great with 4 buffers but starts "thrashing" with 3 buffers,
>>>>>don't you see how that would cause the program to run inordinately slow with HT
>>>>>on? Or if the processor can extract great parallelism from the instruction
>>>>>stream with an n entry reorder window but very little parallelism with an n/2
>>>>>window?
>>>>
>>>>Back to _real_ data.  I run crafty twice.  I get a different level of
>>>>performance than if I run crafty _once_ using two threads.  Yet both have
>>>>the same instruction mix.  Locks are infrequently used so that isn't the
>>>>problem.  However, cache coherency _is_ an issue and is most likely at
>>>>the bottom of this mess for my case.  Invalidating whole lines of cache
>>>>is worse when a line is 128 bytes than when it is only 32 bytes.  Whether
>>>>that is the problem or not is not yet proven, just a pretty well-thought-out
>>>>"hunch".
>>>
>>>Try to think it out more. How could the cache prefer one thread over the other?
>>>I don't see how this is possible with any reasonable design. It's easy enough to
>>>test by writing a simple program, so why don't you do that? And anyway, this
>>>STILL doesn't address my point, which is how HT can cause performance to
>>>degrade.
>>
>>Why don't you "try to think it out?"
>>
>>"cache coherency" has _nothing_ to do with "cache favoring one thread over
>>another."  It has _everything_ to do with cache lines getting invalidated
>>which throws out 128 bytes on PIVs as opposed to 32 bytes on PIIIs.
>>
>>I don't _need_ to write a test program.  I already _have_ one that is
>>causing the problem...
>
>First of all, what in the WORLD do P3s have to do with ANYTHING?

Cache line size, as I explained.  P3=32 bytes.  PIV=128 bytes.  That is the
only significant difference in the two I can find.  Yet the PIV has much worse
two-cpu performance with Crafty than the PIII.  B oth behave similarly if I
run two separate instances of Crafty, as I said.


>
>Second, you're right, I didn't think it out. Since both logical processors use
>the same caches, there IS no cache coherency problem. In fact, cache coherency
>is completely unrelated to hyperthreading, isn't it? I don't even know why you
>would use the two terms in the same sentence. But here you are, using "cache
>coherency" as an explanation for one logical processor running a thread faster
>than the other. ("Well-thought-out hunch," sure.)


I have talked about two problems.  One is the "unbalanced" hyperthreading,
when running Crafty.  And I carefully explained that I can run two threads
with SMT off, or four with SMT on.  I can't run a one and two thread test as
my machine has two processors and one can't be removed without a terminator
that I don't have.

Therefore, the cache coherency issue seems to be important in that it is hurting
performance with two physical processors.  It seems to be the only viable (at
the moment) explanation for the unbalanced SMT performance as well.




>
>>>No, there's a difference between HT being enabled and HT being active. If only
>>>one thread is being run, all of the CPU's resources are "merged" back together.
>>>The Intel slides indicate this and I wrote as much in reply to Eugene's post.
>>
>>That makes _zero_ sense.  Define "idle"?  Typically, in any O/S I know of,
>>all processors are _always_ executing an instruction stream.  Even if it is
>>just the process scheduler "idle loop".
>>
>>So that makes _zero_ sense as a suggested cause.
>
>You're pretty smug for a guy who has absolutely NO idea what he's talking about.

I know a _lot_ about what I am talking about.  Do you know what a "processor"
does when there is no process ready to schedule?

Didn't think so.


>
>http://developer.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p09_task_modes.htm
>
>"On a processor with Hyper-Threading Technology, executing HALT transitions the
>processor from MT-mode to ST0- or ST1-mode, depending on which logical processor
>executed the HALT. For example, if logical processor 0 executes HALT, only
>logical processor 1 would be active; the physical processor would be in ST1-mode
>and partitioned resources would be recombined giving logical processor 1 full
>use of all processor resources. If the remaining active logical processor also
>executes HALT, the physical processor would then be able to go to a lower-power
>mode."


And again, so what?  Who does a "halt"?  Windows .net server _might_.  But
no others I have tested...



>
>>I have concluded that _most_ of the resources are dynamically allocated between
>>the two logical processors "as needed".  That seems to fit all the discussing
>>in comp.sys.* for the past three years...
>
>Again, you have no clue.
>
>http://developer.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p05_front_end.htm
>
>"After uops are fetched from the trace cache or the Microcode ROM, or forwarded
>from the instruction decode logic, they are placed in a "uop queue." This queue
>decouples the Front End from the Out-of-order Execution Engine in the pipeline
>flow. The uop queue is partitioned such that each logical processor has half the
>entries."
>
>"The out-of-order execution engine has several buffers to perform its
>re-ordering, tracing, and sequencing operations. ... Some of these key buffers
>are partitioned such that each logical processor can use at most half the
>entries. Specifically, each logical processor can use up to a maximum of 63
>re-order buffer entries, 24 load buffers, and 12 store buffer entries."
>
>Of course, I'm sure you already knew all of this, what with having read most
>everything on Intel's web site.

"some of these key buffers are..."

Very precise, as I mentioned.


>
>BTW, I assume you're doing your spin waits with the PAUSE instruction, otherwise
>your HT performance will be all jacked up when using threads:

I've been doing them that way since I got the dual xeon, and I reported on
the pause problem here early on.

So "yes" is the answer, as you could find in the lock.h assembly code in
Crafty...  Others have copied it...


>
>http://developer.intel.com/technology/hyperthread/intro_nexgen/sld025.htm
>
>-Tom



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.