Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: random access latency opteron versus k7

Author: Robert Hyatt

Date: 19:06:53 05/30/04

Go up one level in this thread


On May 30, 2004 at 17:37:41, Dieter Buerssner wrote:

>On May 30, 2004 at 16:15:54, Robert Hyatt wrote:
>
>>I have no idea what your program above does [, and really don't care.]
>
>You could have an idea. I showed the source here, and we discussed it.
>http://chessprogramming.org/cccsearch/ccc.php?find_thread=306858
>
>It tries to answer the question: How long do I need on average, to access a
>random word in memory - from programmers point of view. I have a large array of
>words, and need to know the value at one random index (a situation very
>comparable to hashing in chess). The program does not care about how many TLB
>read are needed - just the time until it will have the value (say in a
>register). It is the time of one move instruction
>
>movl (%eax), %eax
>
>or in Intel syntax
>
>mov eax, DWORD PTR [eax]
>
>where, before the instruction eax points to some (valid) word (correctly aligned
>for a pointer), randomly.
>
>Regards,
>Dieter


The test is not very good for > 32 bit addressing.  IE the opteron has a 48 bit
address space.  12 bits for page offset, 36 bits for virtual page number, broken
into four 9 bit indices.  If you try to address less than 48 bit addresses, then
you get by with having one or more of the map tables stuck in L2 cache to cut
the effective access time by one or two or three latency cycles.  IE if you
address 2^21 bytes or less, you only need to access memory once, the map tables
(or the 64 bytes from the first three that are useful) will end up in L1/L2
cache.  The fourth table only has 2^9 words, or 2^11 bytes, which will end up in
cache as well.  But go beyond 2M bytes and now you start to decrease performance
as there will be multiple 4th-level page tables and they all probably won't sit
in cache.  That adds 1 memory latency cycle.  Go beyond 2^30 bytes (1 gig) and
now the bottom two tables will be hit on all the time although the upper 2 will
still be in cache, adding another latency delay (two now plus the latency to
actually read memory).  I don't know if his test went beyond 1 gig, but the
numbers suggest not.  Which means that even if it blows out the 512 TLB data
entries (there are 512 instruction TLB entries as well) most of the missing TLB
data will be handled by page table lookups that are in L2.  Because the program
is doing nothing but looping over memory and not bringing other stuff to
overwrite L2 cache entries.

In short, it is not very effective as a test unless it beats on 2 gigs or more
of RAM, and it does something besides just loop over random addresses doing
reads, as that lets the L2 cache replace the TLB effectively...




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.