Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Another memory latency test

Author: Gerd Isenberg

Date: 03:14:01 07/18/03

Go up one level in this thread


On July 17, 2003 at 16:51:28, Dieter Buerssner wrote:

>On July 17, 2003 at 16:11:33, Gerd Isenberg wrote:
>
>>your algorithm confirms roughly Vincent's results!
>
>Yes, I agree. Actually my code was inspired by the issues you raised about
>Vincent's test (and the strange things we discussed around omnid_abs with random
>loop testing).


Yes, Vincent's loop body is probably to huge due the mod 64 and lot of (Athlons)
vector path instructions. Even if the measure loop without access has some
pipeline "unrolling",  the access loop may have some latency hiding.


>
>>But there is still the question whether this measured times is memory latency
>>per definition - i guess not.
>
>Perhaps. I could not find a good definition in lmbench source fast, and neither
>in the papers about it (but only read over it very fast).
>
>>In worst cases there is more than memory latency (additional TLB-latency and
>>some RAM hardware interface latencies) to get data into a register - maybe a
>>question of definition.
>
>Even, when this is the case, those "real" latencies sound pretty uninteresting
>from programmer's (and this may imply user's) point of view. They may give
>numbers, that are not particually useful for "our" environment. After all, we
>are not a group of hardware designers or OS writers.
>

Yes.

If i understand rigth, there are only 62 entries in TLB that map virtual
addresses to physical page addresses. Page size is 4KByte only in most OS
instead of possible 4MB. Does a factor 1000 greater page size has other
penalties, or why isn't 4MB the default page size?


>
>>Another interesting point is to measure not only the average but the maximum and
>>minimum access times (processors performance counter?). Are the accesses  about
>>equal, or are there heavy spikes due to some chaotic TLB behaviour?
>
>Yes. Perhaps I try it (or even you might want to try it:-).


May be i will play around with performance counters some time, to get some
impressions.


>Performance counters
>on x86 are not without pitfalls. Read for example
>http://cedar.intel.com/software/idap/media/pdf/rdtscpm1.pdf


Thanks for the link.

>
>I would guess, that the access times are in the big majority about equal. Of
>course the random access should yield spikes (towards zero - so actually
>"anti-spikes") now and then, because now and then a small offset to the next
>memory access will (and should) happen, and the data can be fetched out of
>cache.
>
>Many years ago, I knew go32 (one DOS extender, available in source code) quite
>well. I fear my memory faded away. But one might have been able to setup it with
>no virtual memory, page-tables, etc. Perhaps that was the case for another
>alternative to go32 (pmode?). Perhaps, I am totally wrong ...
>
>Cheers,
>Dieter


What i read about Athlon/Opteron may also be interesting in this latency
discussion:

------------------------------------------------------------------------------
AMD Athlon™ Processor x86 Code Optimization Guide

Page 80

Hardware Prefetch

Some AMD Athlon processors implement a hardware prefetch
mechanism. This feature was implemented beginning with
AMD Athlon processor Model 6. The data is loaded into the L2
cache. The hardware prefetcher works most efficiently when
data is accessed on a cache-line-by-cache-line basis (that is,
without skipping cache lines). Cache lines on current AMD
Athlon processors are 64 bytes, but cache line size is
implementation dependent.
In some cases, using the PREFETCH or PREFETCHW
instruction on processors with hardware prefetch may incur a
reduction in performance. In these cases, the PREFETCH
instruction may need to be removed. The engineer needs to
weigh the measured gains obtained on non-hardware prefetch
enabled processors by using the PREFETCH instruction, versus
any loss in performance on processors with the hardware
prefetcher.

------------------------------------------------------------------------------
Software Optimization
Guide for AMD Athlon™ 64
and AMD Opteron™ Processors

Page 128

Hardware Prefetching

The AMD Athlon 64 and AMD Opteron processors implement a hardware prefetching
mechanism. The prefetched data is loaded into the L2 cache. The hardware
prefetcher works most efficiently when data is accessed on a
cache-line-by-cache-line basis (that is, without skipping cache lines). Cache
lines on current AMD Athlon 64 and AMD Opteron processors are 64 bytes, but
cache-line size is implementation dependent.
In some cases, using prefetch instructions on processors with hardware
prefetching may slightly reduce performance. In these cases, the prefetch
instructions may need to be removed. Weigh the measured gains obtained on
non-hardware-prefetch-enabled processors using the software prefetch instruction
against any loss in performance on processors with the hardware prefetcher. All
of the current AMD Athlon 64 and AMD Opteron processors have hardware
prefetching mechanisms.
The hardware prefetcher prefetches data that is accessed in an ascending order
on a cache-line-by-cache- line basis. When the hardware prefetcher detects an
access to cache line l, and then an access to cache line l + 1, it initiates a
prefetch of cache line l + 3. Accessing data in increments larger than
64 bytes may fail to trigger the hardware prefetcher because cache lines are
skipped. In these cases, software-prefetch instructions should be employed. The
hardware prefetcher also is not triggered when code accesses memory in a
descending order.

--
Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.