Author: Robert Hyatt
Date: 19:06:53 05/30/04
Go up one level in this thread
On May 30, 2004 at 17:37:41, Dieter Buerssner wrote: >On May 30, 2004 at 16:15:54, Robert Hyatt wrote: > >>I have no idea what your program above does [, and really don't care.] > >You could have an idea. I showed the source here, and we discussed it. >http://chessprogramming.org/cccsearch/ccc.php?find_thread=306858 > >It tries to answer the question: How long do I need on average, to access a >random word in memory - from programmers point of view. I have a large array of >words, and need to know the value at one random index (a situation very >comparable to hashing in chess). The program does not care about how many TLB >read are needed - just the time until it will have the value (say in a >register). It is the time of one move instruction > >movl (%eax), %eax > >or in Intel syntax > >mov eax, DWORD PTR [eax] > >where, before the instruction eax points to some (valid) word (correctly aligned >for a pointer), randomly. > >Regards, >Dieter The test is not very good for > 32 bit addressing. IE the opteron has a 48 bit address space. 12 bits for page offset, 36 bits for virtual page number, broken into four 9 bit indices. If you try to address less than 48 bit addresses, then you get by with having one or more of the map tables stuck in L2 cache to cut the effective access time by one or two or three latency cycles. IE if you address 2^21 bytes or less, you only need to access memory once, the map tables (or the 64 bytes from the first three that are useful) will end up in L1/L2 cache. The fourth table only has 2^9 words, or 2^11 bytes, which will end up in cache as well. But go beyond 2M bytes and now you start to decrease performance as there will be multiple 4th-level page tables and they all probably won't sit in cache. That adds 1 memory latency cycle. Go beyond 2^30 bytes (1 gig) and now the bottom two tables will be hit on all the time although the upper 2 will still be in cache, adding another latency delay (two now plus the latency to actually read memory). I don't know if his test went beyond 1 gig, but the numbers suggest not. Which means that even if it blows out the 512 TLB data entries (there are 512 instruction TLB entries as well) most of the missing TLB data will be handled by page table lookups that are in L2. Because the program is doing nothing but looping over memory and not bringing other stuff to overwrite L2 cache entries. In short, it is not very effective as a test unless it beats on 2 gigs or more of RAM, and it does something besides just loop over random addresses doing reads, as that lets the L2 cache replace the TLB effectively...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.