Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Another memory latency test

Author: Robert Hyatt
Date: 17:46:10 07/17/03
On July 17, 2003 at 18:53:55, Vincent Diepeveen wrote:

>On July 17, 2003 at 17:08:41, Robert Hyatt wrote:
>
>>On July 17, 2003 at 16:11:33, Gerd Isenberg wrote:
>>
>>>
>>>>Vincent: 256 ns
>>>>Note, random access is faster than before
>>>>I get similar numbers for smaller sizes bigger than the cache
>>>>
>>>>
>>>>Summary: There is more, than just an lmbench number. Actually
>>>>the comment in lmbench source suggests, that they actually
>>>>wanted to get the random access times.
>>>>
>>>>I don't want to argue about defenition of the "real" memory
>>>>latency. But for chess programs/hash the Vinent type number
>>>>is the most interesting.
>>>>
>>>>Regards,
>>>>Dieter
>>>
>>>very interesting Dieter,
>>>
>>>your algorithm confirms roughly Vincent's results!
>>>I will try it later at home and learn from your source code ;-)
>>>But there is still the question whether this measured times is memory latency
>>>per definition - i guess not.
>>
>>The classic definition of memory latency is the amount of time needed to
>>do a random read to any specific memory address.  If you blow the TLB, then
>
>Now that amazes me. All the memory latencies that manufacturers claim are memory
>lookups that are sequential lookups. In fact R14000 is clearly optimized for
>fortran loops

That is nuts.  Fortran loops are no different that C loops, or loops by any
other programming language from Ada to Pl/1.


>
>>you just added another one or two memory accesses which means you are now
>>doing two or three accesses rather than one.
>>
>>Calling this "memory latency" is wrong.
>
>What are you murmuring this time?
>
>The proof Dieter clearly shows is that the further away the memory accesses are
>from each other, EVEN WHEN SEQUENTIAL, the more the times add up.
>
>If you first access adress 0 then adres 200, that's way faster than
>first access adress 0 then 2000 which again is way faster than first adress 0
>then 20000.

This depends.  It depends on multiple things:

1.  Once you step 4096 bytes away from the previous address, you are now
probably accessing a _new_ virtual page address, that most likely won't be
in the TLB, although it is possible that one or both of the page translations
might be in cache.  So you may get a delay of 0, 1, 2, 3 or 4 cycles depending
on how lucky (or unlucky) you are.  Your data may be in cache.  Or not.
The virtual to real translation may be in the TLB or not.  Page table entries
may be in cache or not.  And if the answer to all of those is "not" then you
get a big penalty for having to do two-three memory reads to get that data.

2.  When you read a value from a column in DRAM, if the next read is to the
same column it is marginally faster.  Not _hugely_ faster, however.  Clearly,
once you step farther than a single column, you won't get this speed advantage
but it happens in one "step", it doesn't gradually get worse as you stride
farther.



>
>That proof is beyond any doubt delivered.
>
>>I've said this to him before.  Again, the terminology is pretty specific
>>everywhere I see it.  If you want to define latency to include the MMU
>>overhead, then the latency becomes variable, depending on the page table
>>segment table and TLB accesses.
>
>So basically you are not interesting in how fast you can get a random cache line
>from memory, but you are more interested in what the ram speed can do under
>labatory conditions when not mounted in a computer but a special ram machine
>that just measures latency 1 mm away from the ram with some kind of connection
>thick as a leg to a single DIMM.

I have no idea what you are talking about.  I am interested in answering the
question "for the typical memory reference as done in my program, how long does
it take to get the data into a register?"  That has little to do with your
worst-case blowing the TLB out.  That is a _known_ problem with any hardware
that does virtual addressing and uses memory resident page/segment tables to
accomplish it.  There were machines in the past that did this with no penalty,
but they put _all_ of the page translations into registers.

You are beating on the worst case scenario for hashing, which is really less
than .1% of a chess engine's total cpu time.  Whether the TLB works for hashing
or not is really not critical because hashing is a tiny part of the total
memory bandwidth.


>
>>>
>>>In worst cases there is more than memory latency (additional TLB-latency and
>>>some RAM hardware interface latencies) to get data into a register - maybe a
>>>question of definition.
>>
>>The TLB problem can easily exceed the cost of reading the data.  In fact,
>>the PC has a three-level memory map that most systems don't use.  But to do
>>so adds yet another 120-150ns to latency because you are fetching yet another
>>memory word.
>
>the only thing that matters Bob, is how quick we can get a hashtable entry from
>a random place.
>
>And that's around 400ns at a dual.

Assuming (1) TLB overwrites;  (2) small page size rather than 4M pages that
the X86 can use;  (3) hashing is _all_ you are doing, or it is the dominant
part of what you are doing.

I don't typically run with hash sizes in the 200+ mb range.  I don't need them.
with 4M page sizes, my time to get a hash table entry is _still_ about 150ns on
my machine, as I won't blow the TLB.



>
>>>
>>>Another interesting point is to measure not only the average but the maximum and
>>>minimum access times (processors performance counter?). Are the accesses  about
>>>equal, or are there heavy spikes due to some chaotic TLB behaviour?
>>>
>>>Cheers,
>>>Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.