Author: Robert Hyatt
Date: 22:12:34 05/23/03
Go up one level in this thread
On May 23, 2003 at 23:45:09, Tom Kerrigan wrote:
>On May 23, 2003 at 22:56:43, Robert Hyatt wrote:
>
>>On May 23, 2003 at 02:50:41, Tom Kerrigan wrote:
>>
>>>On May 22, 2003 at 22:24:29, Robert Hyatt wrote:
>>>
>>>>On May 22, 2003 at 13:43:55, Tom Kerrigan wrote:
>>>>
>>>>>On May 21, 2003 at 22:20:57, Robert Hyatt wrote:
>>>>>
>>>>>>On May 21, 2003 at 15:48:46, Tom Kerrigan wrote:
>>>>>>
>>>>>>>On May 21, 2003 at 13:46:26, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On May 20, 2003 at 13:52:01, Tom Kerrigan wrote:
>>>>>>>>
>>>>>>>>>On May 20, 2003 at 00:26:49, Robert Hyatt wrote:
>>>>>>>>>
>>>>>>>>>>Actually it _does_ surprise me. The basic idea is that HT provides improved
>>>>>>>>>>resource utilization within the CPU. IE would you prefer to have a dual 600mhz
>>>>>>>>>>or a single 1000mhz machine? I'd generally prefer the dual 600, although for
>>>>>>>>>
>>>>>>>>>You're oversimplifying HT. When HT is running two threads, each thread only gets
>>>>>>>>>half of the core's resources. So instead of your 1GHz vs. dual 600MHz situation,
>>>>>>>>>what you have is more like a 1GHz Pentium 4 vs. a dual 1GHz Pentium. The dual
>>>>>>>>>will usually be faster, but in many cases it will be slower, sometimes by a wide
>>>>>>>>>margin.
>>>>>>>>
>>>>>>>>Not quite. Otherwise how do you explain my NPS _increase_ when using a second
>>>>>>>>thread on a single physical cpu?
>>>>>>>>
>>>>>>>>The issue is that now things can be overlapped and more of the CPU core
>>>>>>>>gets utilized for a greater percent of the total run-time...
>>>>>>>>
>>>>>>>>If it were just 50-50 then there would be _zero_ improvement for perfect
>>>>>>>>algorithms, and a negative improvement for any algorithm with any overhead
>>>>>>>>whatsoever...
>>>>>>>>
>>>>>>>>And the 50-50 doesn't even hold true for all cases, as my test results have
>>>>>>>>shown, even though I have yet to find any reason for what is going on...
>>>>>>>
>>>>>>>Think a little bit before posting, Bob. I said that the chip's execution
>>>>>>>resources were evenly split, I didn't say that the chip's performance is evently
>>>>>>>split. That's just stupid. You have to figure in how those execution resources
>>>>>>>are utilized and understand that adding more of these resources gives you
>>>>>>>diminishing returns.
>>>>>>>
>>>>>>>-Tom
>>>>>>
>>>>>>
>>>>>>You shold follow your own advice. If resources are split "50-50" then how
>>>>>>can _my_ program produce a 70-30 split on occasion?
>>>>>>
>>>>>>It simply is _not_ possible.
>>>>>>
>>>>>>There is more to this than a simple explanation offers...
>>>>>
>>>>>Now you're getting off onto another topic here.
>>>>>
>>>>
>>>>Read backward. _I_ did not "change the topic".
>>>>
>>>>I said that I don't see how it is possible for HT to slow a program down.
>>>>
>>>>You said "50-50" resource allocation might be an explanation.
>>>>
>>>>I said "that doesn't seem plausible because I have at least one example of
>>>>two compute-bound threads that don't show a 50-50 balance on SMT."
>>>
>>>I said it before and I'll say it again, a 50-50 _core_ resource split does not
>>>mean a 50-50 performance split. Again, you have to account for how those
>>>resources are utilized. Anybody who's passed the first semester of comp arch
>>>should be able to grasp this immediately.
>>
>>You should be able to grasp this: I am running _exactly_ the same program
>>on _both_ processors. And when I say "exactly" the same I mean _exactly the
>>same_. In fact, I am using the _same_ virtual address space on _both_ logical
>>processors.
>>
>>So your reasoning simply doesn't fly in this case. If the resource units are
>>split and are both running the _same_ identical instruction stream, the
>>performance should be exactly split as well. But in my case, it isn't.
>>
>>There is another explanation... Somewhere...
>
>Again, it seems like you're back to your stupid 70-30 problem.
>
>We can deal with this in a sec, let's get back to the actual point, which is
>programs slowing down, or not slowing down, with HT turned on.
>
>First of all, okay, sure, let's say you're right and only SOME of the resources
>are split. Even if only the write combine buffers are split, and you have a
>program that works great with 4 buffers but starts "thrashing" with 3 buffers,
>don't you see how that would cause the program to run inordinately slow with HT
>on? Or if the processor can extract great parallelism from the instruction
>stream with an n entry reorder window but very little parallelism with an n/2
>window?
Back to _real_ data. I run crafty twice. I get a different level of
performance than if I run crafty _once_ using two threads. Yet both have
the same instruction mix. Locks are infrequently used so that isn't the
problem. However, cache coherency _is_ an issue and is most likely at
the bottom of this mess for my case. Invalidating whole lines of cache
is worse when a line is 128 bytes than when it is only 32 bytes. Whether
that is the problem or not is not yet proven, just a pretty well-thought-out
"hunch".
Now if you think that Intel really will take 1/2 of the physical CPU resources
and leave them idle when only one logical processor is working, then I suppose
your explanation might be valid. however, that would make it a bad design. And
since I have yet to see this happen on any of the SMT boxes we have, I have not
"bought" the idea that SMT on is bad for some programs, yet, since I can't
reproduce it in any shape or form, on a windows box or on a linux box. That
doesn't say it _can't_ be reproduced, but that the applications I have tried
will not produce it. That's all I can claim. But I can claim that with 100%
reliability and provability.
>
>Put in terms you might be able to understand, take a system with 512MB RAM. Run
>Crafty on it and set the hash table to 256MB. Runs great, right? Now run another
>copy with a 256MB hash table. Hmm, doesn't run so great, does it?
What does this have to do with the question??? It actually might not run that
badly, btw...
>
>As for your 70-30 problem, you are not running _exactly_ the same program on
>both logical processors. Remember, you did that and the performance was split
>exactly 50-50. You problem is when you start doing threads. That is NOT running
>_exactly_ the same program. E.g., if one thread is spinning, waiting for a lock,
>how is that doing exactly the same thing as the other thread?
First, spins are less than .1% of the total execution time. So _that_ will
not account for this variability. Second, on that definition, _no_ example
can be given for two logical processors running _exactly_ the same thing
as they will hardly _ever_ be at the same point in the instruction stream
which makes this entire point moot.
I believe that the explanation is simpler, and has to do with cache
coherency.
>
>>>Complete bull. This design is no secret--Intel wants everybody to know exactly
>>>how HT works so they can optimize their software for it. This information is all
>>>over Intel's web pages and developer documentation. Links to said pages have
>>>been posted to this message board. It will only take YOU some time to figure out
>>>because your head seems to be stuck in the sand.
>>>
>>>-Tom
>>
>>Give me a link. I have read almost _everything_ on Intel's web site. And I
>>don't find key core descriptions of what is done _internally_...
>
>I don't feel like doing extra work for you, so I just did a 2 second Google
>search ("xeon hyperthreading split reorder") and found this page from Intel
>presentations:
>
>http://www.extremetech.com/print_article/0,3998,a=16756,00.asp
>
>The slide in the middle ("Thread-Selection Points") clearly show what's split in
>half: queue, rename, decode, and retire. The schedule, reg read, execute, and
>reg write steps use a toggle that will switch between threads each clock tick if
>data from two threads is ready. Caches are not split; the reason should be
>obvious.
>
>-Tom
As far as the above, I haven't seen Intel say that the "rename registers" are
split right down the middle. The first explanation I saw was quite the
opposite in fact.
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.