Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Who can update about new 64 bits chip?

Author: Robert Hyatt

Date: 08:07:25 08/26/02

Go up one level in this thread


On August 26, 2002 at 05:13:35, Vincent Lejeune wrote:

>On August 25, 2002 at 23:35:57, Robert Hyatt wrote:
>
>>On August 25, 2002 at 22:08:26, Jeremiah Penery wrote:
>>
>>>On August 25, 2002 at 21:50:48, Robert Hyatt wrote:
>>>
>>>>On August 25, 2002 at 11:21:31, Dan Andersson wrote:
>>>>
>>>>>>
>>>>>>If you look at my response to Vincent, there are a few issues that need
>>>>>>addressing because NUMA introduces a much higher memory latency for
>>>>>>processors that are farther away from the actual memory word needed.  This
>>>>>>has to be addressed to produce reasonable speedups.
>>>>>>
>>>>>You say much higher. The latency for one CPU is 80 ns. For MP one hop is 115 ns,
>>>>
>>>>
>>>>First, I don't believe _any_ of those numbers from vendors.  If they were
>>>>true, the vendors would use that _same_ low-latency memory on the uniprocessor
>>>>boxes.  But they don't.  Which says a lot about the "realness" of the numbers.
>>>
>>>Perhaps you have not read much about the Hammer architecture.  The thing that so
>>>greatly reduces latency is that it has a memory controller on the processor die,
>>>which scales linearly in speed with the processor.  The memory itself is the
>>>same as on any other box.
>>>
>>
>>
>>That was my point. The latency is not a controller issue.  It is an issue
>>about how quickly you can dump a set of capacitors, measure the voltage, and
>>put it back.  I doubt hammer has solved that problem.  Because Cray certainly
>>did not with their machines...
>>
>>
>>>In all current processor/chipset configurations, the CPU has to send a memory
>>>request to the Northbridge of the motherboard, which runs at a low clockspeed.
>>>The northbridge has to send the request on to the main memory, which sends it
>>>back through the same channel.  Hammer eliminates the northbridge setup
>>>completely - memory requests go directly from the processor to the memory banks,
>>>via a high-speed HyperTransport tunnel.
>>
>>
>>That's ok... but it doesn't solve the 100ns delay to dump capacitors...
>>
>>
>>
>>>
>>>With multiple CPUs, an access goes through HyperTransport to whatever CPU is
>>>directly connected to the needed memory first, then proceeds the same way.  Even
>>>with this extra step, it is AT LEAST as fast as current CPU-Northbridge-Memory
>>>setups (it is the same number of steps as that configuration then), because
>>>HyperTransport in general has lower latency than most (all?) current
>>>communication protocols.
>>
>>
>>Now you get to _the_ issue.  For streaming memory requests, the above sounds
>>good.  But for random reads/writes, the latency is not in the controller or
>>bus, it is in the memory chips themselves...
>>
>>For chess I don't care about streaming memory references.  That is something
>>I would be interested in for a large vector-type application, and that is what
>>Cray has been doing so well for years.  But a 60 million dollar Cray _still_
>>can't overcome that random access latency.  Neither will Hammer...
>>
>>>
>>><Large snip>
>>>
>>>>Hopefully we will see some real numbers soon...  But a memory controller on
>>>>chip speaks to problems with more than one chip...
>>>
>>>I eagerly await real numbers also.  It's possible that the quoted numbers are
>>>lower than any real-world figure we may see, but I suspect that memory latency
>>>for Hammer systems will be considerably lower than any current setup, at the
>>>very least.
>>
>>I suspect they will be _identical_ for random accesses.  Which is the kind
>>of accesses we mainly do in CC.
>>
>>
>>>
>>>As for problems with more than one chip, it doesn't look to cause any kind of
>>>problems due to the way it's being handled with multiple HyperTransport tunnels.
>>> However, like anything else, we can only wait and see what real figures look
>>>like.
>>
>>Two controllers == conflicts for the bus.  More than two controllers == more
>>conflicts...   That has always been a problem.  One potential solution is
>>multi-ported memory.  That has its own problems, however, as now you move
>>the conflicts into the memory bank itself...
>
>
>As i saw on the paper : each memory controller access his "local" memory, when
>the requested data is not present in his memory it send an request to the other
>controllers through the HyperTransport bus, there's no conflict, and everything
>is designed to avoid bus overhead I think ...
>
>Waiting for the real numbers ...


Read that again, carefully.  "local memory". This is NUMA.  The penalty for
accessing memory that is _not_ local is significant.  The penalty for accessing
local memory is still 100ns or so, because nobody knows how to reduce
resistance, capacitance and inductance together.

When you have multiple processors there will be significant conflicts.  I don't
know whether that "hypertransport bus" if full-duplex or not.  If it is, it
might work OK for two processors, but not beyond two as there would be no easy
way to manage more than two.  I assume it is a "normal bus" which means if
the two processors want to access each other's local memory, one is definitely
going to wait.  And that also means there is some sort of bus negotiation
protocol which extends latency as well...



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.