Computer Chess Club Archives




Subject: Re: Another memory latency test

Author: Robert Hyatt

Date: 20:59:20 07/24/03

Go up one level in this thread

On July 23, 2003 at 18:32:23, Vincent Diepeveen wrote:

>On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote:
>>On July 22, 2003 at 14:27:45, Robert Hyatt wrote:
>>>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote:
>>>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote:
>>>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote:
>>>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote:
>>>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote:
>>>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote:
>>>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote:
>>>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote:
>>>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote:
>>>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest
>>>>>>>>>>>>stable source today, not the experimental version, available, too). If it would
>>>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you
>>>>>>>>>>>4M pages solves it for at least 250mb worth of RAM.  But then again, _no_ chess
>>>>>>>>>>>program depends on purely random memory accesses to blow out the TLB.  The only
>>>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which
>>>>>>>>>>>both total to significantly less than the total nodes I search.  Which means
>>>>>>>>>>>the TLB penalty is not even 1% of my total run time.  Probably closer to
>>>>>>>>>>>.01% - .05%.
>>>>>>>>>>>I ignore that.
>>>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with
>>>>>>>>>Look at the code.
>>>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete
>>>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time
>>>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about
>>>>>>>>>Crafty probes memory _once_ for a hash probe.  That
>>>>>>>>>introduces a memory access penalty once per node in the basic search,
>>>>>>>>>less than once per node in the q-search (I only probe phash there and I
>>>>>>>>>don't probe it but about 25% of the q-search nodes I visit).
>>>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe
>>>>>>>>20-30% of its total nodes.
>>>>>>>OK.  I clearly mis-read what you meant.  the 20-30% was eye-catching as that
>>>>>>>is a pretty common hash hit percentage as well...
>>>>>>>>>As a result, you get less than one probe per node searched.  A node searched
>>>>>>>>>requires something on the order of 3000-5000 instructions.  What percentage
>>>>>>>>>of that 3K-5K instruction timing is that single hash probe?  Almost zero.
>>>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache
>>>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0.
>>>>>>>You are missing my point.  In the position(s) you tested, you saw 20-30%
>>>>>>>hash probes.  That means one probe for every 3-5 nodes.  At 1M nodes
>>>>>>>per second, that is 200K-300K probes per second.  If you measure the
>>>>>>>time spent in searching a single node, multiply that by 3-5X, then compare
>>>>>>>that to the hash probe time, the time spent probing the hash table is low.
>>>>>>>Note that your 5% is _not_ the total time used to probe the table.  It is
>>>>>>>the time to probe the table, and do it _twice_ although the second probe
>>>>>>>doesn't have any memory access penalty associated with it in most cases.
>>>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(),
>>>>>>>rather than being all memory access penalty...
>>>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test
>>>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k
>>>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the
>>>>>>following code just before the call to RepetitionCheck() in search.c (slightly
>>>>>>modified from the code in hash.c). Note that the code is basically a no-op as
>>>>>>all variables are local.
>>>>>>  static BITBOARD word1;
>>>>>>  BITBOARD temp_hashkey;
>>>>>>  HASH_ENTRY *htable;
>>>>>> ----------------------------------------------------------
>>>>>>|                                                          |
>>>>>>|   first, compute the initial hash address and choose     |
>>>>>>|   which hash table (based on color) to probe.            |
>>>>>>|                                                          |
>>>>>> ----------------------------------------------------------
>>>>>>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>>>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>>>  word1=htable->word1;
>>>>>>Now crafty spends 2.8% of its time in HashProbe.
>>>>>Hi Wesley,
>>>>>that's interesting, it seems that preloading decreases the hash-latency.
>>>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing
>>>>>L2-Cache) is even better.
>>>>I'm sure it would be better. My code doesn't make it run any faster, it just
>>>>shows that the delay due to memory access is significant.
>>>Can you tell me how you conclude this?
>>The *only* effect of the code I added is to ensure that the depth-preferred part
>>of the hash table is put into cache, so any speedup in HashProbe is due to not
>>having a cache (and ATB) miss.
>>>IE there are two parts in HashProbe();
>>>1.  probe "depth-preferred table".
>>>2.  probe "always-store" table".
>>>You are assuming that of the total 3.6% done in HashProbe(), that .8% is
>>>done in the always-store code.  Which means that .8% is done in the depth-
>>>preferred table, and the remaining time is memory latency.
>>>I don't think that is the explanation.
>>>Suppose _many_ hits occur in the depth-preferred table.  Then you won't be
>>>probing the always-store table at those positions.  And your .8% assumption
>>>is not so safe to make.  Unless you run huge searches with a small table,
>>>this effect will distort any possible conclusions.
>>>No way a single random access memory read is 3% of the total time spent
>>>doing a node.  There are way too many _other_ random-access reads done in
>>>crafty to make that possible.  The total time would go over 100%.
>>At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one
>>cache miss is about 150 nsec. This implies that if you have *one* cache miss
>>every 4 nodes, you will spend 3% on that single random access memory read.
>>Apparently, caching works very well for crafty, except for HashProbe( ;).
>==> that should be dual
>At bob's dual the latency for a miss into random memory not yet opened is
>about 400ns, not 150ns.
>Best regards,

Again, _NO_ it isn't.  The memory latency is 150 ns.  The memory _access_
time can be one two or three cycles, depending on what you miss.  IE it is
possible that most of the time a memory access is two cycles, if you do
random reads to a _large_ block of memory.  And it is even possible that
each memory access requires three reads, two to map virtual to real, and
one for the data, but this is much less likely.

You are _not_ talking about "latency".  You are talking about "access time"
which is a different measure.  Latency is _static_.  Access time can vary
depending on how you use (or don't use) the TLB.

>>Again, my figures are on my slow machine. Your machine is ~6x faster, while your
>>memory latency is not much better, so I suspect the figures will be much worse
>>on your machine. You may want to test this for yourself.
>>>>>  static BITBOARD word1;
>>>>>  BITBOARD temp_hashkey;
>>>>>  HASH_ENTRY *htable;
>>>>> ----------------------------------------------------------
>>>>>|                                                          |
>>>>>|   first, compute the initial hash address and choose     |
>>>>>|   which hash table (based on color) to probe.            |
>>>>>|                                                          |
>>>>> ----------------------------------------------------------
>>>>>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>>#ifdef _DOPREFETCH
>>>>>  __asm mov eax, [htable]; // get the pointer
>>>>>  __asm PREFETCHNTA [eax]; // fetch to L1-cache, bypassing L2-Cache
>>>>>  word1=htable->word1;
>>>>>some additional notes from:
>>>>>"AMD Athlon™ Processor x86 Code Optimization Guide"
>>>>>Prefetching versus Preloading
>>>>>In code that uses the block prefetch technique as described in
>>>>>“Optimizing Main Memory Performance for Large Arrays” on page 66, a standard
>>>>>load instruction is the best way to prefetch data. But in other situations, load
>>>>>instructions may be able to mimic the functionality of prefetch instructions,
>>>>>but they do not offer the same performance advantage.Prefetch instructions only
>>>>>update the cache line in the L1/L2 cache and do not update an architectural
>>>>>register. This uses one less register compared to a load instruction. Prefetch
>>>>>instructions also do not cause
>>>>>normal instruction retirement to stall. Another benefit of prefetching versus
>>>>>preloading is that the prefetching instructions can retire even if the load data
>>>>>has not arrived yet. A regular load used for preloading will stall the machine
>>>>>if it gets to the bottom of the fixed-issue reorder buffer (part of the
>>>>>Instruction Control Unit) and the load data has not arrived yet. The load is
>>>>>"blocking" whereas the prefetch is "non-blocking."

This page took 0.04 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.