Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Magic 200MHz

Author: Robert Hyatt

Date: 22:12:34 05/23/03

Go up one level in this thread


On May 23, 2003 at 23:45:09, Tom Kerrigan wrote:

>On May 23, 2003 at 22:56:43, Robert Hyatt wrote:
>
>>On May 23, 2003 at 02:50:41, Tom Kerrigan wrote:
>>
>>>On May 22, 2003 at 22:24:29, Robert Hyatt wrote:
>>>
>>>>On May 22, 2003 at 13:43:55, Tom Kerrigan wrote:
>>>>
>>>>>On May 21, 2003 at 22:20:57, Robert Hyatt wrote:
>>>>>
>>>>>>On May 21, 2003 at 15:48:46, Tom Kerrigan wrote:
>>>>>>
>>>>>>>On May 21, 2003 at 13:46:26, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On May 20, 2003 at 13:52:01, Tom Kerrigan wrote:
>>>>>>>>
>>>>>>>>>On May 20, 2003 at 00:26:49, Robert Hyatt wrote:
>>>>>>>>>
>>>>>>>>>>Actually it _does_ surprise me.  The basic idea is that HT provides improved
>>>>>>>>>>resource utilization within the CPU.  IE would you prefer to have a dual 600mhz
>>>>>>>>>>or a single 1000mhz machine?  I'd generally prefer the dual 600, although for
>>>>>>>>>
>>>>>>>>>You're oversimplifying HT. When HT is running two threads, each thread only gets
>>>>>>>>>half of the core's resources. So instead of your 1GHz vs. dual 600MHz situation,
>>>>>>>>>what you have is more like a 1GHz Pentium 4 vs. a dual 1GHz Pentium. The dual
>>>>>>>>>will usually be faster, but in many cases it will be slower, sometimes by a wide
>>>>>>>>>margin.
>>>>>>>>
>>>>>>>>Not quite.  Otherwise how do you explain my NPS _increase_ when using a second
>>>>>>>>thread on a single physical cpu?
>>>>>>>>
>>>>>>>>The issue is that now things can be overlapped and more of the CPU core
>>>>>>>>gets utilized for a greater percent of the total run-time...
>>>>>>>>
>>>>>>>>If it were just 50-50 then there would be _zero_ improvement for perfect
>>>>>>>>algorithms, and a negative improvement for any algorithm with any overhead
>>>>>>>>whatsoever...
>>>>>>>>
>>>>>>>>And the 50-50 doesn't even hold true for all cases, as my test results have
>>>>>>>>shown, even though I have yet to find any reason for what is going on...
>>>>>>>
>>>>>>>Think a little bit before posting, Bob. I said that the chip's execution
>>>>>>>resources were evenly split, I didn't say that the chip's performance is evently
>>>>>>>split. That's just stupid. You have to figure in how those execution resources
>>>>>>>are utilized and understand that adding more of these resources gives you
>>>>>>>diminishing returns.
>>>>>>>
>>>>>>>-Tom
>>>>>>
>>>>>>
>>>>>>You shold follow your own advice.  If resources are split "50-50" then how
>>>>>>can _my_ program produce a 70-30 split on occasion?
>>>>>>
>>>>>>It simply is _not_ possible.
>>>>>>
>>>>>>There is more to this than a simple explanation offers...
>>>>>
>>>>>Now you're getting off onto another topic here.
>>>>>
>>>>
>>>>Read backward.  _I_ did not "change the topic".
>>>>
>>>>I said that I don't see how it is possible for HT to slow a program down.
>>>>
>>>>You said "50-50" resource allocation might be an explanation.
>>>>
>>>>I said "that doesn't seem plausible because I have at least one example of
>>>>two compute-bound threads that don't show a 50-50 balance on SMT."
>>>
>>>I said it before and I'll say it again, a 50-50 _core_ resource split does not
>>>mean a 50-50 performance split. Again, you have to account for how those
>>>resources are utilized. Anybody who's passed the first semester of comp arch
>>>should be able to grasp this immediately.
>>
>>You should be able to grasp this:  I am running _exactly_ the same program
>>on _both_ processors.  And when I say "exactly" the same I mean _exactly the
>>same_.  In fact, I am using the _same_ virtual address space on _both_ logical
>>processors.
>>
>>So your reasoning simply doesn't fly in this case.  If the resource units are
>>split and are both running the _same_ identical instruction stream, the
>>performance should be exactly split as well.  But in my case, it isn't.
>>
>>There is another explanation...  Somewhere...
>
>Again, it seems like you're back to your stupid 70-30 problem.
>
>We can deal with this in a sec, let's get back to the actual point, which is
>programs slowing down, or not slowing down, with HT turned on.
>
>First of all, okay, sure, let's say you're right and only SOME of the resources
>are split. Even if only the write combine buffers are split, and you have a
>program that works great with 4 buffers but starts "thrashing" with 3 buffers,
>don't you see how that would cause the program to run inordinately slow with HT
>on? Or if the processor can extract great parallelism from the instruction
>stream with an n entry reorder window but very little parallelism with an n/2
>window?

Back to _real_ data.  I run crafty twice.  I get a different level of
performance than if I run crafty _once_ using two threads.  Yet both have
the same instruction mix.  Locks are infrequently used so that isn't the
problem.  However, cache coherency _is_ an issue and is most likely at
the bottom of this mess for my case.  Invalidating whole lines of cache
is worse when a line is 128 bytes than when it is only 32 bytes.  Whether
that is the problem or not is not yet proven, just a pretty well-thought-out
"hunch".

Now if you think that Intel really will take 1/2 of the physical CPU resources
and leave them idle when only one logical processor is working, then I suppose
your explanation might be valid.  however, that would make it a bad design.  And
since I have yet to see this happen on any of the SMT boxes we have, I have not
"bought" the idea that SMT on is bad for some programs, yet, since I can't
reproduce it in any shape or form, on a windows box or on a linux box.  That
doesn't say it _can't_ be reproduced, but that the applications I have tried
will not produce it.  That's all I can claim.  But I can claim that with 100%
reliability and provability.





>
>Put in terms you might be able to understand, take a system with 512MB RAM. Run
>Crafty on it and set the hash table to 256MB. Runs great, right? Now run another
>copy with a 256MB hash table. Hmm, doesn't run so great, does it?

What does this have to do with the question???  It actually might not run that
badly, btw...

>
>As for your 70-30 problem, you are not running _exactly_ the same program on
>both logical processors. Remember, you did that and the performance was split
>exactly 50-50. You problem is when you start doing threads. That is NOT running
>_exactly_ the same program. E.g., if one thread is spinning, waiting for a lock,
>how is that doing exactly the same thing as the other thread?


First, spins are less than .1% of the total execution time.  So _that_ will
not account for this variability.  Second, on that definition, _no_ example
can be given for two logical processors running _exactly_ the same thing
as they will hardly _ever_ be at the same point in the instruction stream
which makes this entire point moot.

I believe that the explanation is simpler, and has to do with cache
coherency.



>
>>>Complete bull. This design is no secret--Intel wants everybody to know exactly
>>>how HT works so they can optimize their software for it. This information is all
>>>over Intel's web pages and developer documentation. Links to said pages have
>>>been posted to this message board. It will only take YOU some time to figure out
>>>because your head seems to be stuck in the sand.
>>>
>>>-Tom
>>
>>Give me a link.  I have read almost _everything_ on Intel's web site.  And I
>>don't find key core descriptions of what is done _internally_...
>
>I don't feel like doing extra work for you, so I just did a 2 second Google
>search ("xeon hyperthreading split reorder") and found this page from Intel
>presentations:
>
>http://www.extremetech.com/print_article/0,3998,a=16756,00.asp
>
>The slide in the middle ("Thread-Selection Points") clearly show what's split in
>half: queue, rename, decode, and retire. The schedule, reg read, execute, and
>reg write steps use a toggle that will switch between threads each clock tick if
>data from two threads is ready. Caches are not split; the reason should be
>obvious.
>
>-Tom

As far as the above, I haven't seen Intel say that the "rename registers" are
split right down the middle.  The first explanation I saw was quite the
opposite in fact.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.