Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Latency versus Information Bandwidth: Questions

Author: Robert Hyatt
Date: 20:48:52 12/06/02
On December 06, 2002 at 16:44:43, Matt Taylor wrote:

>On December 06, 2002 at 07:32:57, Vincent Diepeveen wrote:
>
>>On December 05, 2002 at 01:14:18, Jeremiah Penery wrote:
>>
>>>On December 04, 2002 at 23:23:32, Robert Hyatt wrote:
>>>
>>>>>Current AthlonMP chipsets also have a seperate bus per CPU.  They use the same
>>>>>EV6 bus as Alpha processors did (or still do?).  The memory modules shared,
>>>>>whereas Hammer will have separate memory modules for each processor.
>>>>
>>>>
>>>>The problem with that is it turns into a NUMA architecture which has its _own_
>>>>set of problems.  One cpu connected to one memory module means that the other
>>>>CPU can't get to it as efficiently...
>>>>
>>>>IE this doesn't offer one tiny bit of improvement over a SMP-type machine with
>>>>shared memory...  Unless the algorithm is specifically designed to attempt to
>>>>lccalize memory references and duplicate data that is needed by both threads
>>>>often...
>>>>
>>>>This might be an improvement for running two programs at once.  For one
>>>>program using two processors, NUMA offers additional challenges for the
>>>>parallel programmer...
>>>
>>>According to all documentation, which I have no reason to doubt, a non-local
>>>memory access in a Hammer system is just as fast as a memory access in a
>>>processor/chipset combination where the memory controller resides in the
>>>northbridge (i.e. all other x86 configurations).  Local memory accesses are
>>>quite a lot faster.  Therefore, the average case, even in 8-way machines that
>>>take up to 3 hops for a memory access, is still below that of any x86 machine of
>>>today.
>>
>>If you read the documentation as it is you get confronted with
>>theoretical data which doesn't take into account any part of
>>the configuration which is worst case.
>>
>>Bob is more near the truth here than you might want to guess, because
>>as soon as you go run on those supercomputers with theoretic performance
>>of a certain peak and you go test yourself then the practical peak
>>is up to 50 times slower than the theoretic data suggests.
>>
>>So on paper this is way faster and even works up to 8 cpu's (which is
>>unlikely we ever will see working), as good propagandists those papers
>>are not going to tell you weak spots in the design which prevent
>>that *theoretic* performance from happening in reality.
>>
>>In case they get this dual CPU to work we will see what its speed is.
>>
>>For now i assume it's a cluster like Bob does.
>>
>>Note that it's nearly impossible to get to work a 8 cpu machine with
>>that architecture. Imagine how complex design of it will be.
>>
>>Which OS will work on that?
>>
>>Best regards,
>>Vincent
>
>First of all, this is a crossbar. Other crossbar systems have scaled up to 64
>nodes or so I've heard. Crossbar performance is -much- better than your typical
>NUMA system. Economic crossbar systems take the same approach AMD is taking:
>each node adds a crossbar to the system. I don't think that's coincidence.

Yes, but check the math.  Cray scaled their T90 up to 32 procssors.  over 1/2
the _total_ cost of the machine was in the memory interconnections (crossbar).

But someone is wrong somewhere, and I don't claim to be an AMD expert.  However
if a processor has a dedicated path to memory there is no crossbar.  If it has
a crossbar path to memory there is no dedicated path.  How would you do _both_
without it smelling like NUMA?


>
>Second of all, any OS that supports SMP on shared-bus will support Opteron. All
>of the cache coherency and switching is done in hardware. Optimization can be
>made by recognition that this is a NUMA architecture. However, I think the MP
>1.4 spec which has been available for a couple years allows the specification of
>NUMA configurations. Linux64 and Windows XP 64-bit are old, old announcements.
>
>Third of all, AMD demonstrated a quad-Opteron system at Computex Taipei 2002 (in
>addition to a fair number of other shows). The biggest hurdle in 8-way Opterons
>is finding PCB real-estate. I don't think AMD would be promising such systems if
>they were uncertain whether or not they could deliver, even if other companies
>(Rambus) do.
>
>Right now performance of Opteron systems is admittedly bad, but the performance
>of most prototypes are. As for the chips themselves, an 800 MHz Clawhammer
>prototype was reportedly faster than the 1.6 GHz Williamette in 32-bit code.
>
>-Matt
Re: Latency versus Information Bandwidth: Questions Matt Taylor 14:15:43 12/07/02
- Re: Latency versus Information Bandwidth: Questions Vincent Diepeveen 15:01:01 12/07/02
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.