Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Since the CPU is what really count for Chess !

Author: Robert Hyatt

Date: 14:29:58 03/19/03

Go up one level in this thread


On March 19, 2003 at 15:58:37, Matt Taylor wrote:

>On March 19, 2003 at 13:52:31, Robert Hyatt wrote:
>
>>On March 19, 2003 at 12:53:42, Matt Taylor wrote:
>>
>>>On March 18, 2003 at 23:09:08, Robert Hyatt wrote:
>>>
>>>>On March 18, 2003 at 19:45:43, Tom Kerrigan wrote:
>>>>
>>>>>On March 18, 2003 at 18:20:14, Robert Hyatt wrote:
>>>>>
>>>>>>On March 18, 2003 at 17:46:10, Tom Kerrigan wrote:
>>>>>>
>>>>>>>On March 18, 2003 at 16:37:35, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>>>1.  no interleaving, which means that the raw memory latency is stuck at
>>>>>>>>>>120+ns and stays there.  Faster bus means nothing without interleaving,
>>>>>>>>>>if latency is the problem.
>>>>>>>>>
>>>>>>>>>Uh, wait a minute, didn't you just write a condescending post to me about how
>>>>>>>>>increasing bandwidth improves latency? (Which I disagree with...) You can't have
>>>>>>>>>it both ways.
>>>>>>>>>
>>>>>>>>>Faster bus speed improves both latency and bandwidth. How can it not?
>>>>>>>>
>>>>>>>>It doesn't affect random latency whatsoever.  It does affect the time taken to
>>>>>>>>load a
>>>>>>>>cache line.  Which does affect latency in a different way.  However,
>>>>>>>>interleaving does
>>>>>>>>even better as even though it doesn't change latency either, it will load a
>>>>>>>>cache line even
>>>>>>>>faster.
>>>>>>>
>>>>>>>Are you kidding me? How can FSB speed _not_ affect latency?
>>>>>>
>>>>>>Very simple.  Latency is caused _in_ the memory system, only a tiny part of
>>>>>>latency
>>>>>>is caused by the delay of shipping the data over the bus.  If you ran the bus
>>>>>...
>>>>>>Run the test.  This discussion was held on r.g.c.p a while back.  And the _same_
>>>>>>results were found.  Memory has 120ns latency no matter _what_ memory you
>>>>>>use.  RDRAM is even slower in terms of latency.  If you can get your memory to
>>>>>>sub-100ns latency, you've done a miracle in modern electronics.
>>>>>
>>>>>I guess I'm sitting in front of one miraculous computer, then, because it can
>>>>>randomly access a word in 75ns. Just ran the test. (RDRAM, BTW.)
>>>>
>>>>Yes you are.  You have the fastest single CPU on the planet.  Notice that to
>>>>do this test, you have to access a byte, skip down 128 bytes and access another
>>>>and repeat this for a _long_ set of addresses.  If you _still_ get 75ns
>>>>you _do_ have the fastest PC latency ever reported by any serious tester.
>>>
>>>AMD thinks so too. The most accurate figure I've found is about 70 ns for the
>>>on-die memory controller that Clawhammer has. (I saw some claims of sub-40 ns,
>>>but I find that hard to believe.)
>>>
>>>>>If you have a 133MHz DIMM that's rated at 2-1-1-1, it can obviously access a
>>>>>word in 15ns.
>>>>
>>>>I don't believe 15ns for a second.  Just look at current specs for DRAM and
>>>>tell me how that is going to happen?  Again, look at any memory benchmarking
>>>>done on the internet by folks that do this for a living.  _nobody_ has reported
>>>>sub 100ns latency for any test I have seen, when talking about the PC.  Or
>>>>when talking about a sixty million dollar Cray.
>>>
>>>15 ns is believable. You must remember that ram is configured as rows and
>>>columns. The full 100-120 ns is the latency of opening a new row and reading.
>>>You and Tom seem to be talking about different things here. A completely random
>>>access is going to hit RAS and stall the full 100-120 ns. Reloading the column
>>>will only hit CAS and stall for 15 ns.
>>
>>The only memory latency that is interesting is "random access latency".
>>Anything else
>>plays right into things like RDRAM and makes it look great, when its random
>>access
>>latency is bad.  The DDR memory in my dual xeon is 150ns which looks poor
>>compared
>>to the 130ns SDRAM in my much cheaper PIII laptop.
>
>Your "cheaper PIII laptop" doesn't need all the extra baggage that comes with
>dual processors.

Yes, but my quad 700 is _also_ at 130ns latency, and it also uses SDRAM.


>
>>>>> If the system gets that word in 75ns (ignoring RDRAM vs. DIMM
>>>>>latency for now) that means 20% of the latency is from the memory and 80% (not
>>>>>"a tiny part") is from "shipping the data over the bus" (and through the
>>>>>northbridge). Conventional wisdom says there's a 10ns wire/pin delay for a
>>>>>signal going into or out of a chip, so into northbridge + out of northbridge +
>>>>>into processor = 30ns. That means 30ns of processing is done on the northbridge
>>>>>and processor. That's why everybody is so worked up about Hammer's on-die memory
>>>>>controller--it reduces memory latency by, well, somewhere between 20 and 50ns,
>>>>>or roughly 50%.
>>>>>
>>>>>End of today's lecture...
>>>>
>>>>Now to get some _real_ data before giving the _next_ lecture.  As I said,
>>>>access 1M bytes, with a 128 byte stride so cache-line pre-fetching won't
>>>>artificially bias the result downward.
>>>>
>>>>I'll try to run this on a group of dual xeons here tomorrow, starting with my
>>>>2.8's and also trying the 3.06's.
>>>>
>>>>Several of us did this on R.G.C.P a few months back however, and 120+ ns
>>>>was the _best_ time reported when the test was run correctly.
>>>
>>>I got 133 ns as well. Aaron was running tests like crazy this morning on his
>>>nForce 2, and he reported times as low as 70 ns. I find that -very- impressive.
>>>Of course, that was with massive memory overclocking.
>>>
>>>-Matt
>>
>>Still it is the fastest time I have _ever_ seen for memory latency.  The X86
>>pipeline
>>to memory is _so_ long.  Going thru the mmu, to L1, to L2 to the bus, to the
>>memory
>>controller, to the chip, pulling that off in 70ns reliably is _remarkable_.
>>Particularly
>>when you consider that the latency for 60 million dollar computers like the Cray
>>T90
>>is over 100ns.  The infamous Cray-2, the first machine with 32 gigabytes of RAM
>>had
>>a similar latency.  2.1ns clock, 50 clock cycle latency.
>
>I believe the MMU comes after L1 & L2 to reduce latency of cache accesses. I
>thought it was awkward at first (reloading cr3 therefore invalidates the entire
>cache), but it makes sense.

I was under the impression that the PC cached real addresses, rather than
virtual addresses,
so that it is not necessary to dump the entire contents of L1/L2 any time a
context switch
is done.  I'll go back and look at the hardware refs again tonight to be sure.

Putting the MMU behind the cache is better for performance, when you are talking
about
number-crunching.   Putting in in front of cache is better for a machine that
will be doing a
lot of context-switching, such as a web server.


>
>>So 70 seems not only good, but _incredibly_ good.  lmbench is pretty good at
>>measuring
>>this accurately, so long as you are sure you tell it to use way more RAM than
>>will fit into
>>cache.
>>
>>I broke it on my quad 700 due to the 1MB L2, but when I had it use 128mb for the
>>memory
>>tests, things dropped back to 130ns for that box as well, but my quad 700 used
>>SDRAM which
>>seems to be the best there is right now.
>
>Agreed, 70 ns is superb. AMD talks about Clawhammer's 80 ns latency being a big
>deal. I can't get a solid figure (only guesses), but the 70 ns seems accurate.
>When Opteron comes out, I plan to get one, and I'll see about testing memory
>latency at that point.
>
>-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.