Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Who can update about new 64 bits chip?

Author: Robert Hyatt

Date: 12:06:05 08/26/02

Go up one level in this thread


On August 26, 2002 at 13:12:42, Jeremiah Penery wrote:

>On August 26, 2002 at 11:16:47, Robert Hyatt wrote:
>
>>On August 26, 2002 at 00:54:33, Jeremiah Penery wrote:
>>
>>>On August 25, 2002 at 23:35:57, Robert Hyatt wrote:
>>>
>>>>On August 25, 2002 at 22:08:26, Jeremiah Penery wrote:
>>>>
>>>>>On August 25, 2002 at 21:50:48, Robert Hyatt wrote:
>>>>>
>>>>>>On August 25, 2002 at 11:21:31, Dan Andersson wrote:
>>>>>>
>>>>>>>>
>>>>>>>>If you look at my response to Vincent, there are a few issues that need
>>>>>>>>addressing because NUMA introduces a much higher memory latency for
>>>>>>>>processors that are farther away from the actual memory word needed.  This
>>>>>>>>has to be addressed to produce reasonable speedups.
>>>>>>>>
>>>>>>>You say much higher. The latency for one CPU is 80 ns. For MP one hop is 115 ns,
>>>>>>
>>>>>>
>>>>>>First, I don't believe _any_ of those numbers from vendors.  If they were
>>>>>>true, the vendors would use that _same_ low-latency memory on the uniprocessor
>>>>>>boxes.  But they don't.  Which says a lot about the "realness" of the numbers.
>>>>>
>>>>>Perhaps you have not read much about the Hammer architecture.  The thing that so
>>>>>greatly reduces latency is that it has a memory controller on the processor die,
>>>>>which scales linearly in speed with the processor.  The memory itself is the
>>>>>same as on any other box.
>>>>>
>>>>
>>>>
>>>>That was my point. The latency is not a controller issue.  It is an issue
>>>>about how quickly you can dump a set of capacitors, measure the voltage, and
>>>>put it back.  I doubt hammer has solved that problem.  Because Cray certainly
>>>>did not with their machines...
>>>
>>>Do you deny that sending it to a slow controller (a controller that runs at FSB
>>>speed, which is mostly 133MHz nowadays), which sends it back to memory, where it
>>>gets processed, sent back through slow controller, then back to the CPU adds any
>>>latency to the process?  Hammer eliminates all that.  And, its memory controller
>>>runs at CPU speed to boot.  Of course you'll still have the normal latencies
>>>expected because DRAM is just slow, but latency for Hammer systems will be a lot
>>>better than current setups.
>>
>>I don't deny that at all.  It probably accounts for a couple of nanoseconds
>>max, as there are practically no gates in existence today that are not switching
>>at rates measured in picoseconds, not nanoseconds.
>>
>>There is _still_ the issue of latency _in_ the memory chips.  That has _not_
>>reduced in 20+ years.  Nor will it until DRAM is replaced by something else
>>that doesn't have to shunt a charge somewhere...
>
>That's what I was saying directly below.
>
>>>Modern memory latencies are somewhere near 70ns for a random read, and ~40ns or
>>>so for subsequent sequential reads.  All the rest of the latency is added by the
>>>slow setup of CPU--memory controller--memory--memory controller--CPU.
>>
>>That is 70ns averaged over 8 bytes.  Which is not what I want to measure.  I
>>want to know "how long does it take from the time I issue a "load" instruction
>>for some random word of memory until that word is present in the register and
>>I can use it?"  That isn't going to be 70ns + controller time.  All the recent
>>memory tricks, starting with FPM, then EDO, then SDRAM, and on to rambus and
>>whatever, are simply tricks for quickly accessing _additional_ bytes beyond
>>the first 8.  But I want that _first_ one.
>
>That's the number I'm talking about.  The RAM chips themselves take only 70ns or
>so.  ALL the rest of that time comes from sending the data (across rather long
>pathway) through a separate, slow, memory controller, which it has to go through
>twice (and of course along the long pathway).
>
>>IE use a new pentium cpu and use the MSR stuff to see how many clocks you
>>have to wait on random memory reads.  Then multiple that by the clock speed.
>>_that_ is the number I am interested in.  Not the magical nonsense quoted by
>>manufacturers.  Hint:  RDRAM.  Its a dog for random access.
>
>Well the new P4s have clock cycle time of some 0.4ns, of course you have to wait
>more clocks.  And I know RDRAM sucks for latency, because the travel path runs
>along the length of ALL the RIMMs (and back), in conjunction with the long path
>already to/from the CPU/memory controller.
>
>>>>>In all current processor/chipset configurations, the CPU has to send a memory
>>>>>request to the Northbridge of the motherboard, which runs at a low clockspeed.
>>>>>The northbridge has to send the request on to the main memory, which sends it
>>>>>back through the same channel.  Hammer eliminates the northbridge setup
>>>>>completely - memory requests go directly from the processor to the memory banks,
>>>>>via a high-speed HyperTransport tunnel.
>>>>
>>>>
>>>>That's ok... but it doesn't solve the 100ns delay to dump capacitors...
>>>
>>>Nah, doesn't take quite that long. :)
>>
>>It is _very_ close...  On most machines today it is actually longer.
>>
>>100ns is 100 clocks at 1ghz.  I _wish_ I could do random access reads in
>>100 clocks...
>
>As I said, much of that time must come from the long travel path, slow memory
>controller, etc.

We just disagree on where the delay is.  I believe that at _least_ 80% of
the latency is in the chip.  Not in the controller/bus...


>
>>>Even if it did, Hammer would have way lower latency than current setups, which
>>>are on the order of 200-300ns minimum.
>>
>>I don't see memory that slow today.  I am seeing numbers in the 120-160ns
>>range for random latency.  Hammer isn't going to beat that by much if any.
>>But it is easier to wait and test, since it is impossible to depend on the
>>marketing hype...
>
>If you see 140ns today (average), you don't believe that almost half of that
>latency is caused by the travel path from CPU/controller/memory and back?  If
>the memory controller runs at bus speed (133MHz), it has 7.5ns/clock cycle.
>That alone is significant latency added to the process.

I don't believe it, no.  I believe that most of the latency is in the DRAM
itself, not in the controller.  The controller has no "capacitors" to deal
with, it is made up of SRAM buffers and some form of hardware logic (such
as TTL) which means switching times are at the picosecond level.  It takes
a _bunch_ of picoseconds to add up to a nanosecond...




>
>>>>>With multiple CPUs, an access goes through HyperTransport to whatever CPU is
>>>>>directly connected to the needed memory first, then proceeds the same way.  Even
>>>>>with this extra step, it is AT LEAST as fast as current CPU-Northbridge-Memory
>>>>>setups (it is the same number of steps as that configuration then), because
>>>>>HyperTransport in general has lower latency than most (all?) current
>>>>>communication protocols.
>>>>
>>>>
>>>>Now you get to _the_ issue.  For streaming memory requests, the above sounds
>>>>good.  But for random reads/writes, the latency is not in the controller or
>>>>bus, it is in the memory chips themselves...
>>>>
>>>>For chess I don't care about streaming memory references.  That is something
>>>>I would be interested in for a large vector-type application, and that is what
>>>>Cray has been doing so well for years.  But a 60 million dollar Cray _still_
>>>>can't overcome that random access latency.  Neither will Hammer...
>>>>
>>>>>
>>>>><Large snip>
>>>>>
>>>>>>Hopefully we will see some real numbers soon...  But a memory controller on
>>>>>>chip speaks to problems with more than one chip...
>>>>>
>>>>>I eagerly await real numbers also.  It's possible that the quoted numbers are
>>>>>lower than any real-world figure we may see, but I suspect that memory latency
>>>>>for Hammer systems will be considerably lower than any current setup, at the
>>>>>very least.
>>>>
>>>>I suspect they will be _identical_ for random accesses.  Which is the kind
>>>>of accesses we mainly do in CC.
>>>
>>>That would be absurd, considering the amount of overhead removed from the
>>>'normal' process by putting memory controller on the processor die.
>>
>>
>>As I said, a memory controller is not a huge animal.  We are talking nanoseconds
>>at most here.  Not 100's of nanoseconds, or even many 10's...
>
>Even a few 10s reduces your number of 120ns to the claimed 80ns of Hammer. :)

I'll believe 80 when I actually get my hands on it.  :)  Because that will
be faster than any Cray ever made (that used DRAM, older crays used bipolar
but the memory was not nearly as dense).


>
>>>>>As for problems with more than one chip, it doesn't look to cause any kind of
>>>>>problems due to the way it's being handled with multiple HyperTransport tunnels.
>>>>> However, like anything else, we can only wait and see what real figures look
>>>>>like.
>>>>
>>>>Two controllers == conflicts for the bus.  More than two controllers == more
>>>>conflicts...   That has always been a problem.  One potential solution is
>>>>multi-ported memory.  That has its own problems, however, as now you move
>>>>the conflicts into the memory bank itself...
>>>
>>>The bus used in Hammer systems is pretty much nothing like anything used in
>>>current systems.  Bandwidth scales linearly with number of processors.
>>
>>
>>Bandwidth is not interesting in the context of chess.  That is why RDRAM
>>is a dog for chess engines.  High Bandwidth.  High Latency.
>
>Yep, RDRAM sucks for that.
>
>>The two terms (bandwidth and latency) can't be interchanged.  Either (or both)
>>may be more important depending on the application.  For chess, it is latency
>>over bandwidth.
>
>But I don't see why you can't have good latency and high bandwidth.

You can.  But having one doesn't guarantee the other, which was my point.


>
>>>Processors are connected to each other (and to memory) by high bandwidth,
>>>bi-directional, low-latency HyperTransport tunnels.  There should be practically
>>>no scaling or conflict issues, judging from the data I've seen.
>>
>>
>>"practically no scaling issues...".  :)  They are better than Cray?
>>
>>:)
>
>Not sure how you derive that from my comments.  Of course they will have some
>issues, eventually, but I'm talking about desktop/small server systems.  Cray is
>in a completely different class altogether.  You always complain about 2-CPU
>systems having a lack of bandwidth, as well as 8-CPU ones - Hammer systems
>should not have those problems.  I have no clue how the things would do at like
>64+ processors, because generally that's not a 'normal' system anymore, when you
>have to link multiple nodes with crossbar switches or whatever.

I was talking about cray from the perspective that they have never had an 80ns
memory access time.  It has _always_ been over 100 since they moved away from
bipolar memory to DRAM for density.  And their controllers have _never_ "sucked"

:)


>
>>>Of course, like everything else, I could be wrong.  Again, the only thing we can
>>>really do is wait and see. :)
>>
>>
>>There we agree...



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.