Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Latency versus Information Bandwidth: Questions

Author: Matt Taylor

Date: 14:15:43 12/07/02

Go up one level in this thread


On December 06, 2002 at 23:48:52, Robert Hyatt wrote:

>On December 06, 2002 at 16:44:43, Matt Taylor wrote:
>
>>On December 06, 2002 at 07:32:57, Vincent Diepeveen wrote:
>>
>>>On December 05, 2002 at 01:14:18, Jeremiah Penery wrote:
>>>
>>>>On December 04, 2002 at 23:23:32, Robert Hyatt wrote:
>>>>
>>>>>>Current AthlonMP chipsets also have a seperate bus per CPU.  They use the same
>>>>>>EV6 bus as Alpha processors did (or still do?).  The memory modules shared,
>>>>>>whereas Hammer will have separate memory modules for each processor.
>>>>>
>>>>>
>>>>>The problem with that is it turns into a NUMA architecture which has its _own_
>>>>>set of problems.  One cpu connected to one memory module means that the other
>>>>>CPU can't get to it as efficiently...
>>>>>
>>>>>IE this doesn't offer one tiny bit of improvement over a SMP-type machine with
>>>>>shared memory...  Unless the algorithm is specifically designed to attempt to
>>>>>lccalize memory references and duplicate data that is needed by both threads
>>>>>often...
>>>>>
>>>>>This might be an improvement for running two programs at once.  For one
>>>>>program using two processors, NUMA offers additional challenges for the
>>>>>parallel programmer...
>>>>
>>>>According to all documentation, which I have no reason to doubt, a non-local
>>>>memory access in a Hammer system is just as fast as a memory access in a
>>>>processor/chipset combination where the memory controller resides in the
>>>>northbridge (i.e. all other x86 configurations).  Local memory accesses are
>>>>quite a lot faster.  Therefore, the average case, even in 8-way machines that
>>>>take up to 3 hops for a memory access, is still below that of any x86 machine of
>>>>today.
>>>
>>>If you read the documentation as it is you get confronted with
>>>theoretical data which doesn't take into account any part of
>>>the configuration which is worst case.
>>>
>>>Bob is more near the truth here than you might want to guess, because
>>>as soon as you go run on those supercomputers with theoretic performance
>>>of a certain peak and you go test yourself then the practical peak
>>>is up to 50 times slower than the theoretic data suggests.
>>>
>>>So on paper this is way faster and even works up to 8 cpu's (which is
>>>unlikely we ever will see working), as good propagandists those papers
>>>are not going to tell you weak spots in the design which prevent
>>>that *theoretic* performance from happening in reality.
>>>
>>>In case they get this dual CPU to work we will see what its speed is.
>>>
>>>For now i assume it's a cluster like Bob does.
>>>
>>>Note that it's nearly impossible to get to work a 8 cpu machine with
>>>that architecture. Imagine how complex design of it will be.
>>>
>>>Which OS will work on that?
>>>
>>>Best regards,
>>>Vincent
>>
>>First of all, this is a crossbar. Other crossbar systems have scaled up to 64
>>nodes or so I've heard. Crossbar performance is -much- better than your typical
>>NUMA system. Economic crossbar systems take the same approach AMD is taking:
>>each node adds a crossbar to the system. I don't think that's coincidence.
>
>Yes, but check the math.  Cray scaled their T90 up to 32 procssors.  over 1/2
>the _total_ cost of the machine was in the memory interconnections (crossbar).
>
>But someone is wrong somewhere, and I don't claim to be an AMD expert.  However
>if a processor has a dedicated path to memory there is no crossbar.  If it has
>a crossbar path to memory there is no dedicated path.  How would you do _both_
>without it smelling like NUMA?

Opteron is using HyperTransport for an interconnect. It's beening spearheaded by
at least AMD and nVidia (as well as others whom I have forgotten). In fact,
nVidia used it in the NB/SB they built for Microsoft for the X-box. Not only
does it look good on paper, it works in practice.

http://www.amd.com/us-en/assets/content_type/DownloadableAssets/HyperTransport_-_Chris_Neuts.pdf

I can't find any official figures, but HyperTransport latency is supposed to be
low. AMD claims that 8-way Opteron systems can still access remote memory faster
without NUMA optimization than current shared-bus architectures. A dual-Opteron
is faster in practice than a dual-Xeon or dual-Athlon. Remember, they save
latency (~20%) by incorporating the memory controller on-die. Even though
Opteron is NUMA, the latency in accessing another CPU's memory is quite possibly
lower than current shared-bus systems, especially in 2-way systems.

The point about the dedicated bandwidth was that each chip comes with 2.7 GB/sec
built-in memory bandwidth, and as you add chips you get more. If every CPU is
hammering the same (small) segment of memory, the caches will incur the
transactions rather than main memory. In fact, it should be quite difficult to
design an algorithm that an Opteron system *can't* handle as efficiently as an
equivalent Athlon/P4 system.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.