Computer Chess Club Archives




Subject: Re: Another memory latency test

Author: Robert Hyatt

Date: 13:07:08 07/22/03

Go up one level in this thread

On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote:

>On July 22, 2003 at 14:27:45, Robert Hyatt wrote:
>>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote:
>>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote:
>>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote:
>>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote:
>>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote:
>>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote:
>>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote:
>>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote:
>>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote:
>>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest
>>>>>>>>>>>stable source today, not the experimental version, available, too). If it would
>>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you
>>>>>>>>>>4M pages solves it for at least 250mb worth of RAM.  But then again, _no_ chess
>>>>>>>>>>program depends on purely random memory accesses to blow out the TLB.  The only
>>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which
>>>>>>>>>>both total to significantly less than the total nodes I search.  Which means
>>>>>>>>>>the TLB penalty is not even 1% of my total run time.  Probably closer to
>>>>>>>>>>.01% - .05%.
>>>>>>>>>>I ignore that.
>>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with
>>>>>>>>Look at the code.
>>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete
>>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time
>>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about
>>>>>>>>Crafty probes memory _once_ for a hash probe.  That
>>>>>>>>introduces a memory access penalty once per node in the basic search,
>>>>>>>>less than once per node in the q-search (I only probe phash there and I
>>>>>>>>don't probe it but about 25% of the q-search nodes I visit).
>>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe
>>>>>>>20-30% of its total nodes.
>>>>>>OK.  I clearly mis-read what you meant.  the 20-30% was eye-catching as that
>>>>>>is a pretty common hash hit percentage as well...
>>>>>>>>As a result, you get less than one probe per node searched.  A node searched
>>>>>>>>requires something on the order of 3000-5000 instructions.  What percentage
>>>>>>>>of that 3K-5K instruction timing is that single hash probe?  Almost zero.
>>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache
>>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0.
>>>>>>You are missing my point.  In the position(s) you tested, you saw 20-30%
>>>>>>hash probes.  That means one probe for every 3-5 nodes.  At 1M nodes
>>>>>>per second, that is 200K-300K probes per second.  If you measure the
>>>>>>time spent in searching a single node, multiply that by 3-5X, then compare
>>>>>>that to the hash probe time, the time spent probing the hash table is low.
>>>>>>Note that your 5% is _not_ the total time used to probe the table.  It is
>>>>>>the time to probe the table, and do it _twice_ although the second probe
>>>>>>doesn't have any memory access penalty associated with it in most cases.
>>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(),
>>>>>>rather than being all memory access penalty...
>>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test
>>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k
>>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the
>>>>>following code just before the call to RepetitionCheck() in search.c (slightly
>>>>>modified from the code in hash.c). Note that the code is basically a no-op as
>>>>>all variables are local.
>>>>>  static BITBOARD word1;
>>>>>  BITBOARD temp_hashkey;
>>>>>  HASH_ENTRY *htable;
>>>>> ----------------------------------------------------------
>>>>>|                                                          |
>>>>>|   first, compute the initial hash address and choose     |
>>>>>|   which hash table (based on color) to probe.            |
>>>>>|                                                          |
>>>>> ----------------------------------------------------------
>>>>>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>>  word1=htable->word1;
>>>>>Now crafty spends 2.8% of its time in HashProbe.
>>>>Hi Wesley,
>>>>that's interesting, it seems that preloading decreases the hash-latency.
>>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing
>>>>L2-Cache) is even better.
>>>I'm sure it would be better. My code doesn't make it run any faster, it just
>>>shows that the delay due to memory access is significant.
>>Can you tell me how you conclude this?
>The *only* effect of the code I added is to ensure that the depth-preferred part
>of the hash table is put into cache, so any speedup in HashProbe is due to not
>having a cache (and ATB) miss.

I'm not sure you can even deduce that.  IE repetition check goes thru a long
sequential list of memory values.  It is possible that they interact with the
cache line with the hash entry.

However, I now understand what you did.  You just replicated the code from
HashProbe() to pre-fetch something that should be used again very quickly.

>>IE there are two parts in HashProbe();
>>1.  probe "depth-preferred table".
>>2.  probe "always-store" table".
>>You are assuming that of the total 3.6% done in HashProbe(), that .8% is
>>done in the always-store code.  Which means that .8% is done in the depth-
>>preferred table, and the remaining time is memory latency.
>>I don't think that is the explanation.
>>Suppose _many_ hits occur in the depth-preferred table.  Then you won't be
>>probing the always-store table at those positions.  And your .8% assumption
>>is not so safe to make.  Unless you run huge searches with a small table,
>>this effect will distort any possible conclusions.
>>No way a single random access memory read is 3% of the total time spent
>>doing a node.  There are way too many _other_ random-access reads done in
>>crafty to make that possible.  The total time would go over 100%.
>At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one
>cache miss is about 150 nsec. This implies that if you have *one* cache miss
>every 4 nodes, you will spend 3% on that single random access memory read.
>Apparently, caching works very well for crafty, except for HashProbe( ;).

however, we know more than that.  IE 1usec/node is an _average_.  _every_
node has a call to HashProbe().  Which means _every_ node is likely going to
see a cache miss for that particular memory read.  The entire point of the
Zobrist hashing scheme is that a single change on the chess board produces a
wild change in the hash signature, sending us to a different page of memory
and the resulting hash miss.  If you use 450ns for a miss, you get one of
those _every_ node.  What about the _other_ cache misses that happen?   one
more and you are _done_ as you just exhausted the entire 1usec of time you
have.  Conclusion?  it isn't taking _that_ long.

>Again, my figures are on my slow machine. Your machine is ~6x faster, while your
>memory latency is not much better, so I suspect the figures will be much worse
>on your machine. You may want to test this for yourself.

I'll try to do this when I can.  However, as you see from the above, I
do a HashProbe() at _every_ node.  And it is very likely that _every_ one
is a TLB miss and a cache miss.  So _every_ node is going to incur an
access penalty to translate the virtual to real address and then to fetch
the data.  This means one random access read might turn into two, or worst-
case three random access reads.  If we are losing 450nsec there, out of
1000nsec total per node, then _no_ other memory read can be a cache miss/
TLB miss.  Otherwise we are out of time.

The math just doesn't add up.

The simplest flaw in Vincent's testing is that _all_ he is doing is frying
the memory access data path with page translations and data reads.  With
nothing to interlace among them.  Fortunately, a chess engine doesn't behave
like that.

>>>>  static BITBOARD word1;
>>>>  BITBOARD temp_hashkey;
>>>>  HASH_ENTRY *htable;
>>>> ----------------------------------------------------------
>>>>|                                                          |
>>>>|   first, compute the initial hash address and choose     |
>>>>|   which hash table (based on color) to probe.            |
>>>>|                                                          |
>>>> ----------------------------------------------------------
>>>>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>#ifdef _DOPREFETCH
>>>>  __asm mov eax, [htable]; // get the pointer
>>>>  __asm PREFETCHNTA [eax]; // fetch to L1-cache, bypassing L2-Cache
>>>>  word1=htable->word1;
>>>>some additional notes from:
>>>>"AMD Athlon™ Processor x86 Code Optimization Guide"
>>>>Prefetching versus Preloading
>>>>In code that uses the block prefetch technique as described in
>>>>“Optimizing Main Memory Performance for Large Arrays” on page 66, a standard
>>>>load instruction is the best way to prefetch data. But in other situations, load
>>>>instructions may be able to mimic the functionality of prefetch instructions,
>>>>but they do not offer the same performance advantage.Prefetch instructions only
>>>>update the cache line in the L1/L2 cache and do not update an architectural
>>>>register. This uses one less register compared to a load instruction. Prefetch
>>>>instructions also do not cause
>>>>normal instruction retirement to stall. Another benefit of prefetching versus
>>>>preloading is that the prefetching instructions can retire even if the load data
>>>>has not arrived yet. A regular load used for preloading will stall the machine
>>>>if it gets to the bottom of the fixed-issue reorder buffer (part of the
>>>>Instruction Control Unit) and the load data has not arrived yet. The load is
>>>>"blocking" whereas the prefetch is "non-blocking."

This page took 0.02 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.