Author: Robert Hyatt
Date: 14:54:37 05/24/03
Go up one level in this thread
On May 24, 2003 at 16:23:26, Tom Kerrigan wrote:
>On May 24, 2003 at 01:12:34, Robert Hyatt wrote:
>>On May 23, 2003 at 23:45:09, Tom Kerrigan wrote:
>>>First of all, okay, sure, let's say you're right and only SOME of the resources
>>>are split. Even if only the write combine buffers are split, and you have a
>>>program that works great with 4 buffers but starts "thrashing" with 3 buffers,
>>>don't you see how that would cause the program to run inordinately slow with HT
>>>on? Or if the processor can extract great parallelism from the instruction
>>>stream with an n entry reorder window but very little parallelism with an n/2
>>>window?
>>
>>Back to _real_ data. I run crafty twice. I get a different level of
>>performance than if I run crafty _once_ using two threads. Yet both have
>>the same instruction mix. Locks are infrequently used so that isn't the
>>problem. However, cache coherency _is_ an issue and is most likely at
>>the bottom of this mess for my case. Invalidating whole lines of cache
>>is worse when a line is 128 bytes than when it is only 32 bytes. Whether
>>that is the problem or not is not yet proven, just a pretty well-thought-out
>>"hunch".
>
>Try to think it out more. How could the cache prefer one thread over the other?
>I don't see how this is possible with any reasonable design. It's easy enough to
>test by writing a simple program, so why don't you do that? And anyway, this
>STILL doesn't address my point, which is how HT can cause performance to
>degrade.
Why don't you "try to think it out?"
"cache coherency" has _nothing_ to do with "cache favoring one thread over
another." It has _everything_ to do with cache lines getting invalidated
which throws out 128 bytes on PIVs as opposed to 32 bytes on PIIIs.
I don't _need_ to write a test program. I already _have_ one that is
causing the problem...
>
>>Now if you think that Intel really will take 1/2 of the physical CPU resources
>>and leave them idle when only one logical processor is working, then I suppose
>>your explanation might be valid. however, that would make it a bad design.
>
>No, there's a difference between HT being enabled and HT being active. If only
>one thread is being run, all of the CPU's resources are "merged" back together.
>The Intel slides indicate this and I wrote as much in reply to Eugene's post.
That makes _zero_ sense. Define "idle"? Typically, in any O/S I know of,
all processors are _always_ executing an instruction stream. Even if it is
just the process scheduler "idle loop".
So that makes _zero_ sense as a suggested cause.
>
>>>Put in terms you might be able to understand, take a system with 512MB RAM. Run
>>>Crafty on it and set the hash table to 256MB. Runs great, right? Now run another
>>>copy with a 256MB hash table. Hmm, doesn't run so great, does it?
>>
>>What does this have to do with the question??? It actually might not run that
>>badly, btw...
>
>You have certain resources. A program uses > 50% of those resources. When given
>only 50% of the resources, it runs like crap.
>
>Sort of like how a program can use > 50% of a CPU's write combine buffers, and
>runs like crap when it's limited to 50%.
>
>Get it?
Only if that is the way things are done, which doesn't seem to be the
case...
>
>>>http://www.extremetech.com/print_article/0,3998,a=16756,00.asp
>>>
>>>The slide in the middle ("Thread-Selection Points") clearly show what's split in
>>>half: queue, rename, decode, and retire. The schedule, reg read, execute, and
>>>reg write steps use a toggle that will switch between threads each clock tick if
>>>data from two threads is ready. Caches are not split; the reason should be
>>>obvious.
>>>
>>>-Tom
>>
>>As far as the above, I haven't seen Intel say that the "rename registers" are
>>split right down the middle. The first explanation I saw was quite the
>>opposite in fact.
>
>I guess it's kind of ambiguous from the slide. Really, I don't care. You seem to
>have conceded that most of the resources are split, which is fine with me.
I have concluded that _most_ of the resources are dynamically allocated between
the two logical processors "as needed". That seems to fit all the discussing
in comp.sys.* for the past three years...
>
>-Tom
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.