Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Another memory latency test

Author: Gerd Isenberg

Date: 14:57:31 07/18/03

Go up one level in this thread


>>>4M pages solves it for at least 250mb worth of RAM.  But then again, _no_ chess
>>>program depends on purely random memory accesses to blow out the TLB.  The only
>>>truly random accesses I do are the regular hashing and pawn hashing, which
>>>both total to significantly less than the total nodes I search.  Which means
>>>the TLB penalty is not even 1% of my total run time.  Probably closer to
>>>.01% - .05%.
>>>
>>>I ignore that.
>>
>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with
>>crafty. If you are getting 1m nodes/sec, then this is a probe every 3-5 usec.
>>With a very large hash table and 4K pages, the large majority of these will
>>cause a TLB miss. At 200 nsec each (a guess), this could be up to 5% of your
>>total run time.
>>
>>[snip]
>
>Also from "AMD Athlon™ Processor x86 Code Optimization Guide"
>
>"The L1 data cache has an associated two-level TLB structure.
>The first-level TLB is fully associative and contains 32 entries
>(24 that map 4-Kbyte pages and eight that map 2-Mbyte or
>4-Mbyte pages). The second-level TLB is four-way set
>associative and contains 256 entries, which can map 4-Kbyte
>pages."
>
>So I don't think that "4M pages solves it for at least 250mb worth of RAM" on
>the Athlons.
>
>It would be interesting to hear what steps, if any, can be taken to minimize
>this problem. Both from a user's and a programmer's perspective.

Another quote from
Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors

"Software prefetching can help to hide the memory latency."

I'm not sure whether bypassing L2-cache if probing/stroring is a good idea to
reduce L2-cache pollution, but worth a try:


   prefetchnta [hashAddress]   ; Load a line into the L1 data cache; mark the
                               ; line so it won’t be evicted to L2.

// about 200ns-400ns processing
// most likely pure register processing gp/mmx/sse2
// with a few reads of a often used source structure
//  and a few writes to a often used target structure
// ....

// now doing the prefteched hash read, already in 1.Level cache
// eg. eigth 128bit entries
   movdqa xmm0, [hashAddress]
   movdqa xmm1, [hashAddress+16]
   movdqa xmm2, [hashAddress+32]
   movdqa xmm3, [hashAddress+48]
   movdqa xmm4, [hashAddress+64]
   movdqa xmm5, [hashAddress+80]
   movdqa xmm6, [hashAddress+96]
   movdqa xmm7, [hashAddress+112]

// ...
// writing to hash

   movntdq [hashAddress+...], xmm0

few further AMD comments on movntdq:

"Nontemporal writes, such as MOVNTQ, MOVNTPS, and MOVNTDQ, should only be used
on data that is not going to be accessed again in the near future."

"MOVNTDQ Move Non-Temporal Double Quadword

Stores a 128-bit (double quadword) XMM register value into a 128-bit memory
location. This instruction indicates to the processor that the data is
non-temporal, and is unlikely to be used again soon. The processor treats the
store as a write-combining (WC) memory write, which minimizes cache pollution.
The exact method by which cache pollution is minimized depends on the hardware
implementation of the instruction. For further information, see “Memory
Optimization” in Volume 1.
MOVNTDQ is weakly-ordered with respect to other instructions that operate on
memory. Software should use an SFENCE instruction to force strong memory
ordering of MOVNTDQ with respect to other stores."


Gerd




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.