Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: random access latency opteron versus k7

Author: Robert Hyatt
Date: 08:54:00 05/31/04
On May 30, 2004 at 22:06:53, Robert Hyatt wrote:

>On May 30, 2004 at 17:37:41, Dieter Buerssner wrote:
>
>>On May 30, 2004 at 16:15:54, Robert Hyatt wrote:
>>
>>>I have no idea what your program above does [, and really don't care.]
>>
>>You could have an idea. I showed the source here, and we discussed it.
>>http://chessprogramming.org/cccsearch/ccc.php?find_thread=306858
>>
>>It tries to answer the question: How long do I need on average, to access a
>>random word in memory - from programmers point of view. I have a large array of
>>words, and need to know the value at one random index (a situation very
>>comparable to hashing in chess). The program does not care about how many TLB
>>read are needed - just the time until it will have the value (say in a
>>register). It is the time of one move instruction
>>
>>movl (%eax), %eax
>>
>>or in Intel syntax
>>
>>mov eax, DWORD PTR [eax]
>>
>>where, before the instruction eax points to some (valid) word (correctly aligned
>>for a pointer), randomly.
>>
>>Regards,
>>Dieter
>
>
>The test is not very good for > 32 bit addressing.  IE the opteron has a 48 bit
>address space.  12 bits for page offset, 36 bits for virtual page number, broken
>into four 9 bit indices.  If you try to address less than 48 bit addresses, then
>you get by with having one or more of the map tables stuck in L2 cache to cut
>the effective access time by one or two or three latency cycles.  IE if you
>address 2^21 bytes or less, you only need to access memory once, the map tables
>(or the 64 bytes from the first three that are useful) will end up in L1/L2
>cache.  The fourth table only has 2^9 words, or 2^11 bytes, which will end up in
>cache as well.  But go beyond 2M bytes and now you start to decrease performance
>as there will be multiple 4th-level page tables and they all probably won't sit
>in cache.  That adds 1 memory latency cycle.  Go beyond 2^30 bytes (1 gig) and
>now the bottom two tables will be hit on all the time although the upper 2 will
>still be in cache, adding another latency delay (two now plus the latency to
>actually read memory).  I don't know if his test went beyond 1 gig, but the
>numbers suggest not.  Which means that even if it blows out the 512 TLB data
>entries (there are 512 instruction TLB entries as well) most of the missing TLB
>data will be handled by page table lookups that are in L2.  Because the program
>is doing nothing but looping over memory and not bringing other stuff to
>overwrite L2 cache entries.
>
>In short, it is not very effective as a test unless it beats on 2 gigs or more
>of RAM, and it does something besides just loop over random addresses doing
>reads, as that lets the L2 cache replace the TLB effectively...

I found a couple of run I had made on the opteron myself.  I tested on two
machines.  One with 8 gigs of RAM (the 4 x 2.2ghz box I used in CCT6) and one
with 64 gigs of RAM (4 x 1.8ghz that I tested on some prior to CCT6).

For random probes to a table of N bytes, where N was small, the access time was
right on AMDs quoted 65ns, assuming that the table was local to the CPU that was
doing the accessing.  If not it could double (two cpus are 1 hop away) or triple
(one cpu is two hops away).  If you only access 2^9 pages, you have one page
table that is full, the other three have only one pointer each and they will
stick in cache.  But then so will the 4th page table as it contains 2^9 entries
of 8 bytes each or 2^12 bytes.  Fits into L1/L2 easily.  Of course there is no
way to thrash the 512 + 512 TLB entries so this is moot.

Up to 2^18 pages requires the full usage of the last two levels of page tables,
2^18 entries or 2^21 bytes.  That blows both the TLB and L2, and what will
likely happen is that the last page table will thrash in cache (the TLB is
hopeless already) and every access is going to take 2 memory cycles (the first
two page tables have one pointer each, and stick in cache, the third page table
is used on every access so it will likely stick in L2 (It is 4kb remember) so
only the 4th page tables thrash around (there are 2^9 of them).  This should
produce an access time of around 130ns...  It did for me.

My testing went on to a bigger table.  Next stop after the above (the above
represents a 1 gb table) was 64 gigs on the bigger box I tested.  This now uses
64 entries in the second page table (256 bytes or 4 cache lines) and this begins
to thrash the third level table and even a bit of the second level table.  And
the access times I saw here had some variability to them, from 195ns to 230ns.
I'm pretty sure that with more memory it would settle in closer to 260, since
only the first page table would have one entry, the rest would thrash and even
the first would thrash a bit.

To reach max access time would require some ugly paging at the moment, as the
boxes I used had a processor-limit of 2^48 for virtual addresses, 2^40 for
physical RAM.  To really thrash the page tables in cache would require
addressing 2^48 bytes, but that is beyond physical ram and would toss the disk
drives into the equation.

In any case, for larger memory opterons, access time is _not_ 130ns.  For small
memory, yes.  The 8-way box I tested had 32 gigs as a reference.

Of course someone could test this on a 2M L2 xeon and get better access times as
well as that would hold big chunks of the page tables when the TLB gets
thrashed.  But to say that the opteron is 2.7x faster than a PC is just wrong,
unless you want to say that for a small-memory opteron (say 1 GB) it is 2.7x
faster.  But wander beyond that 1gb limit and 130 changes quickly to 195 and
beyond...

Again, all this is from actual testing on the three machines I mentioned.  All
had the same 40/48 bit physical/virtual address limit...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.