Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Magic 200MHz

Author: Robert Hyatt

Date: 21:19:49 05/26/03

Go up one level in this thread


On May 26, 2003 at 15:24:02, Tom Kerrigan wrote:

>On May 26, 2003 at 13:54:44, Robert Hyatt wrote:
>>On May 24, 2003 at 20:04:28, Tom Kerrigan wrote:
>>>Second, you're right, I didn't think it out. Since both logical processors use
>>>the same caches, there IS no cache coherency problem. In fact, cache coherency
>>>is completely unrelated to hyperthreading, isn't it? I don't even know why you
>>>would use the two terms in the same sentence. But here you are, using "cache
>>>coherency" as an explanation for one logical processor running a thread faster
>>>than the other. ("Well-thought-out hunch," sure.)
>>
>>I have talked about two problems.  One is the "unbalanced" hyperthreading,
>>when running Crafty.  And I carefully explained that I can run two threads
>>with SMT off, or four with SMT on.  I can't run a one and two thread test as
>>my machine has two processors and one can't be removed without a terminator
>>that I don't have.
>>
>>Therefore, the cache coherency issue seems to be important in that it is hurting
>>performance with two physical processors.  It seems to be the only viable (at
>>the moment) explanation for the unbalanced SMT performance as well.
>
>Fine. What was your point again? That HT's design favors one logical processor
>over the other? Because the cache coherency system _might_? Okay, I give up. One
>_small_ aspect of HT _may_ favor a specific logical processor, based on one
>experiment with one program that hasn't been independently reproduced. Man, you
>sure won that argument. (I especially like all the handwaving about P3 cache
>line sizes even though the discussion was about HT.)

There is no "hand waving".

I mentioned cache as a possible explanation of why my PIII systems run
threaded crafty much more efficiently than PIV systems run it.  I have looked
at the problem under a microscope for a couple of months, playing with alignment
issues and other ideas.  The problem remains.  And the only visible difference
between the PIII and PIV is cache line size.  On a PIII when one cpu writes to
a memory location it invalidates the corresponding cache line on all other
CPUS, which is a total of 32 bytes.  On the PIV it is 128 bytes.  Whether that
is the full story of my problem or not is not yet known.

But since it _is_ a potential problem, and since my program is also seeing a
very odd SMT balance between logical cpus, it is a potential issue there as
well.

Nothing more, nothing less...


>
>>I know a _lot_ about what I am talking about.  Do you know what a "processor"
>>does when there is no process ready to schedule?
>>
>>Didn't think so.
>
>After the opterating system issues a HALT, any reasonably current processor goes
>into a low-power mode and doesn't execute any instructions until it receives an
>interrupt. So, yes, I do know.
>
>>And again, so what?  Who does a "halt"?  Windows .net server _might_.  But
>>no others I have tested...
>
>Any version of Windows NT and Linux. Really, Bob, just search for "halt
>instruction windows" in Google.

I don't need to.  I have this really nasty habit of fiddling with the linux
kernel source frequently.  I _know_ what it does.


>
>Besides, how do you explain HT processors (with HT enabled) running single
>threaded programs at full speed, as they do in all online hardware reviews? If
>the operating systems aren't issuing HALT instructions (as you contend), that
>single thread is only getting half the chip's resources. Doesn't seem likely
>that it would run at full speed with half the resources, does it?

First, your statement is wrong.  There have been reports that running a
single thread with SMT on runs slower than a single thread with SMT off.

That was where this discussion started on RC5 in fact.

As _I_ have repeatedly said, I have _never_ seen a case where a single
thread runs slower with SMT on.  That was my original claim.  I have yet
to see anything _different_ anywhere.

So I don't quite see what your point is unless it is to reinforce _my_ point
about SMT not slowing things down in any way I can see...  Unless we talk about
the case of running two threads using two logical cpus being slower than running
one thread on one real cpu.  I can see where _that_ could cause speed issues in
lots of ways, particularly with a parallel search.


>
>Wait a minute, don't you _teach_ this stuff?
>
>>>"The out-of-order execution engine has several buffers to perform its
>>>re-ordering, tracing, and sequencing operations. ... Some of these key buffers
>>>are partitioned such that each logical processor can use at most half the
>>>entries. Specifically, each logical processor can use up to a maximum of 63
>>>re-order buffer entries, 24 load buffers, and 12 store buffer entries."
>>>
>>>Of course, I'm sure you already knew all of this, what with having read most
>>>everything on Intel's web site.
>>
>>"some of these key buffers are..."
>>
>>Very precise, as I mentioned.
>
>"I have concluded that _most_ of the resources are dynamically allocated between
>the two logical processors 'as needed'."
>
>So if you don't think that those buffers don't constitute "most" of the OOOE
>resources, then how about you name which huge, important buffers are dynamically
>allocated? Please, Bob, grace us with your infinite hyperthreaded wisdom.


The _critical_ resources are the "pipes" that execute micro-ops, and the
register rename pool (not the rename tables) that holds enough data to keep
things busy.  Followed by memory read/write buffers...




>
>-Tom



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.