Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Precharging at DDR ram

Author: Robert Hyatt

Date: 14:56:00 07/17/03

Go up one level in this thread


On July 17, 2003 at 00:26:21, Keith Evans wrote:

>On July 16, 2003 at 22:40:10, Vincent Diepeveen wrote:
>
>>On July 16, 2003 at 13:04:40, Keith Evans wrote:
>>
>>>On July 16, 2003 at 07:20:50, Vincent Diepeveen wrote:
>>>
>>>>On July 16, 2003 at 00:44:34, Keith Evans wrote:
>>>>
>>>>>On July 16, 2003 at 00:29:43, Robert Hyatt wrote:
>>>>>
>>>>>>On July 16, 2003 at 00:05:29, Keith Evans wrote:
>>>>>>
>>>>>>>On July 15, 2003 at 23:35:30, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On July 15, 2003 at 23:05:37, Vincent Diepeveen wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Now i can disproof again the 130ns figure that Bob keeps giving here for dual
>>>>>>>>>machines and something even faster than that for single cpu (up to 60ns or
>>>>>>>>>something). Then i'm sure he'll be modifying soon his statement something like
>>>>>>>>>to "that it is not interesting to know the time of a hashtable lookup, because
>>>>>>>>>that is not interesting to know; instead the only scientific intersting thing is
>>>>>>>>>to know is how much bandwidth a machine can actually achieve".
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>What is _interesting_ is the fact that you are incapable of even recalling
>>>>>>>>the numbers I posted.
>>>>>>>>
>>>>>>>>to wit:
>>>>>>>>
>>>>>>>>dual xeon 2.8ghz, 400mhz FSB.  149ns latency
>>>>>>>>
>>>>>>>>PIII/750 laptop, SDRAM.  125ns.
>>>>>>>>
>>>>>>>>Aaron posted the 60+ ns numbers for his overclocked athlon.  I assume his
>>>>>>>>numbers are as accurate as mine since he _did_ run lm_bench, rather than
>>>>>>>>something with potential bugs.
>>>>>>>>
>>>>>>>>I can post bandwidth numbers if you want, but that has nothing to do with
>>>>>>>>latency, as those of us understanding architecture already know.
>>>>>>>>
>>>>>>>
>>>>>>>Can you run lmbench and give the latency numbers for different stride sizes?
>>>>>>>Then you could quote numbers from cache,...
>>>>>>>
>>>>>>
>>>>>>Here's my laptop data.  L1 seems to be 4 clocks.  L2 9 clocks, memory
>>>>>>at 130ns.  This is a PIII/750mhs machine with SDRAM.  I just ran it again
>>>>>>to produce these numbers.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Host                 OS   Mhz   L1 $   L2 $    Main mem    Guesses
>>>>>>--------- -------------   ---   ----   ----    --------    -------
>>>>>>scrappy    Linux 2.4.20   744 4.0370 9.4300       130.2
>>>>>>
>>>>>>>In the lmbench paper they have a nice graph like this.
>>>>>>
>>>>>>
>>>>>>Is the above what you want?
>>>>>
>>>>>I think that it's as close as you're going to get. The most important thing is
>>>>>that 130 [ns] is the largest number. And wouldn't that be a little bit
>>>>>pessimistic even for chess hash tables?
>>>>
>>>>this is optimistic, because those latency numbers are sequential latency
>>>>numbers. Already opened gates at the RAM you can read faster from than if you
>>>>must open a new one at a random spot.
>>>>
>>>>Trivially hashtables you have not opened it at that random spot yet.
>>>>
>>>>That is an additional latency extra that addes to this 130. Most likely that
>>>>will add up to like above 280 ns up to 400 ns for dual Xeons DDR ram 133Mhz.
>>>>
>>>>Best regards,
>>>>Vincent
>>>
>>>Let's take a simple example for starters:
>>>
>>>Say that you read from memory location 0x00000000, then 0x01000000, then
>>>0x02000000.
>>>
>>>Do you define this as sequential? What hardware mechanism makes the accesses at
>>>0x01000000 and 0x02000000 occur faster than the first access to location
>>>0x00000000?
>>
>>http://www.vml.co.uk/Data/ddr_256mbit.pdf
>>
>>It describes it a bit. In this case for DDR ram.
>>
>>See for example page 8 the one last line.
>>
>>"200 clock cycles are required between the DLL reset and any read command"
>>
>>
>>then in page 17 the explanation:
>>  "the read command is used to initiate a burst read access to an active row.
>>   ... if auto precharge is selected, the row being accessed will be precharged
>>at the end of the read burst; if auto precharge is not selected  then the row
>>will remain opened for subsequent accesses"
>>
>>
>>and don't forget to checkout page 21.
>>
>>and so on. there is enough data there.
>
>Do you know what a DLL is? It's a delay locked loop - something similar but
>simpler than a PLL (phase locked loop.) These are often used in digital circuits
>for things like doubling a clock frequency, getting delays which are a fraction
>of clock long,... (Xilinx has some good material on this which you can check
>out.)
>
>Now the quote that you gave from page 8 is from the section "Initialization -
>DDR SDRAMs must be powered up and initialized in a predefined manner" I don't
>know why you think that this has anything to do with normal reads or writes. The
>200 ns that you refer to is typically a one time operation.
>
>I already know about the second item that you quoted. Noticed that my addresses
>were not in the same row. So this does not apply.
>
>You might look at the part that says:
>"3. BA0-BA1 provide bank address and A0-A12 provide row address.
> 4. BA0-BA1 provide bank address; A0-Ai provide column address (where i=8 for
>x16, 9 for x8 and 11 for x4 except A10); A10 HIGH enables the auto precharge
>feature (nonpersistent), A10 LOW disables the auto precharge feature"
>
>Just looking at that do you think that all of the addresses that I gave are in
>the same row?
>
>If not, then doesn't that imply that the row will have to be opened for each
>successive access?
>
>I did some DRAM controller design about 10 years ago, and the internals haven't
>really changed that much. I've never done any DDR design but from a quick look
>here's my SWAG at it:
>
>Let's assume that we need to do a ACTIVE then READ then PRECHARGE with CL=2 DDR
>RAM operating with a clock frequency of 133 MHz. I believe that this adds up to
>about 9 clocks which would be almost 70 ns. See tRCD (18 ns) + tRP (18 ns) plus
>the CL=2 read access. Then you have to add in the additional delays inside of
>the chipset and the processor.
>
>Please point out the missing ns in the above.


I think, after reading his wild rambling stuff, he is simply mis-using the
term "latency".  If you probe to a random address that is in a page that has
not been referenced recently (it is not in the TLB), then you have to access
one or two pages of memory to read a word to compute the physical address for
that specific random virtual address.  That adds 1-2 _more_ memory cycles to
the thing, making the latency 2-3x higher, apparently.  However, anyone that
uses virtual memory knows that this is done and that is why we have the TLB
in the first place.

And it certainly overlooks the point that not all machines do virtual
memory.  That was an early decision by Cray to not do this in his machines,
both at Univac, CDC and finally his own company Cray Research.  Yes the
alpha-based machines do virtual memory.  But not the Cray supercomputers like
the cray-1/x/y/C90/T90 machines.  Because he didn't want _variable_ memory
access times depending on whether the address had been accessed recently or
not.  And he didn't have them.  The cray's latency was static, period.  It was
120ns on every Cray I ever used, through the T90.  I have not tried an X1 yet,
so I won't claim to know the latency there.  But I'd bet 120 +/- 30ns




This page took 0.02 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.