Computer Chess Club Archives




Subject: Re: Another memory latency test

Author: Robert Hyatt

Date: 22:18:44 07/18/03

Go up one level in this thread

On July 18, 2003 at 06:14:01, Gerd Isenberg wrote:

>On July 17, 2003 at 16:51:28, Dieter Buerssner wrote:
>>On July 17, 2003 at 16:11:33, Gerd Isenberg wrote:
>>>your algorithm confirms roughly Vincent's results!
>>Yes, I agree. Actually my code was inspired by the issues you raised about
>>Vincent's test (and the strange things we discussed around omnid_abs with random
>>loop testing).
>Yes, Vincent's loop body is probably to huge due the mod 64 and lot of (Athlons)
>vector path instructions. Even if the measure loop without access has some
>pipeline "unrolling",  the access loop may have some latency hiding.
>>>But there is still the question whether this measured times is memory latency
>>>per definition - i guess not.
>>Perhaps. I could not find a good definition in lmbench source fast, and neither
>>in the papers about it (but only read over it very fast).
>>>In worst cases there is more than memory latency (additional TLB-latency and
>>>some RAM hardware interface latencies) to get data into a register - maybe a
>>>question of definition.
>>Even, when this is the case, those "real" latencies sound pretty uninteresting
>>from programmer's (and this may imply user's) point of view. They may give
>>numbers, that are not particually useful for "our" environment. After all, we
>>are not a group of hardware designers or OS writers.
>If i understand rigth, there are only 62 entries in TLB that map virtual
>addresses to physical page addresses. Page size is 4KByte only in most OS
>instead of possible 4MB. Does a factor 1000 greater page size has other
>penalties, or why isn't 4MB the default page size?

Take that 62 with a grain of salt.  The way you derive this number is
obvious.  You touch a page, then using the stride of page-length, you
step through N pages, then back up and repeat.  Compute the effective
access time.  If you start off at 8, clearly the virtual-to-real ends
up in the TLB and you will see the normal memory access time equal to
the raw memory latency of say 125-150ns.  Keep increasing the range
of pages you address.  Eventually the access time will take a huge
jump when you run out of the TLB.  IE so long as you access 62 pages
on my xeon, you will stick in the TLB and see fast memory speeds.  When
you cycle over 63 or more pages, by the time you cycle around to the first
page its virtual-to-real translation is gone from the TLB and every page
reference takes the hit.

I don't believe 62 entries myself.  That's a strange number.  I'd suspect
64 or 128 at least, with 256 more likely.  But lm-bench estimated 62 although
I didn't study this much.  I've not looked to see what Intel claims.

>>>Another interesting point is to measure not only the average but the maximum and
>>>minimum access times (processors performance counter?). Are the accesses  about
>>>equal, or are there heavy spikes due to some chaotic TLB behaviour?
>>Yes. Perhaps I try it (or even you might want to try it:-).
>May be i will play around with performance counters some time, to get some
>>Performance counters
>>on x86 are not without pitfalls. Read for example
>Thanks for the link.
>>I would guess, that the access times are in the big majority about equal. Of
>>course the random access should yield spikes (towards zero - so actually
>>"anti-spikes") now and then, because now and then a small offset to the next
>>memory access will (and should) happen, and the data can be fetched out of
>>Many years ago, I knew go32 (one DOS extender, available in source code) quite
>>well. I fear my memory faded away. But one might have been able to setup it with
>>no virtual memory, page-tables, etc. Perhaps that was the case for another
>>alternative to go32 (pmode?). Perhaps, I am totally wrong ...
>What i read about Athlon/Opteron may also be interesting in this latency
>AMD Athlon™ Processor x86 Code Optimization Guide
>Page 80
>Hardware Prefetch
>Some AMD Athlon processors implement a hardware prefetch
>mechanism. This feature was implemented beginning with
>AMD Athlon processor Model 6. The data is loaded into the L2
>cache. The hardware prefetcher works most efficiently when
>data is accessed on a cache-line-by-cache-line basis (that is,
>without skipping cache lines). Cache lines on current AMD
>Athlon processors are 64 bytes, but cache line size is
>implementation dependent.
>In some cases, using the PREFETCH or PREFETCHW
>instruction on processors with hardware prefetch may incur a
>reduction in performance. In these cases, the PREFETCH
>instruction may need to be removed. The engineer needs to
>weigh the measured gains obtained on non-hardware prefetch
>enabled processors by using the PREFETCH instruction, versus
>any loss in performance on processors with the hardware
>Software Optimization
>Guide for AMD Athlon™ 64
>and AMD Opteron™ Processors
>Page 128
>Hardware Prefetching
>The AMD Athlon 64 and AMD Opteron processors implement a hardware prefetching
>mechanism. The prefetched data is loaded into the L2 cache. The hardware
>prefetcher works most efficiently when data is accessed on a
>cache-line-by-cache-line basis (that is, without skipping cache lines). Cache
>lines on current AMD Athlon 64 and AMD Opteron processors are 64 bytes, but
>cache-line size is implementation dependent.
>In some cases, using prefetch instructions on processors with hardware
>prefetching may slightly reduce performance. In these cases, the prefetch
>instructions may need to be removed. Weigh the measured gains obtained on
>non-hardware-prefetch-enabled processors using the software prefetch instruction
>against any loss in performance on processors with the hardware prefetcher. All
>of the current AMD Athlon 64 and AMD Opteron processors have hardware
>prefetching mechanisms.
>The hardware prefetcher prefetches data that is accessed in an ascending order
>on a cache-line-by-cache- line basis. When the hardware prefetcher detects an
>access to cache line l, and then an access to cache line l + 1, it initiates a
>prefetch of cache line l + 3. Accessing data in increments larger than
>64 bytes may fail to trigger the hardware prefetcher because cache lines are
>skipped. In these cases, software-prefetch instructions should be employed. The
>hardware prefetcher also is not triggered when code accesses memory in a
>descending order.

This page took 0.01 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.