Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Another memory latency test

Author: J. Wesley Cleveland

Date: 09:58:56 07/23/03

Go up one level in this thread


On July 22, 2003 at 16:07:08, Robert Hyatt wrote:

>On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote:
>
>>On July 22, 2003 at 14:27:45, Robert Hyatt wrote:
>>
>>>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote:
>>>
>>>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote:
>>>>
>>>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote:
>>>>>
>>>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote:
>>>>>>
>>>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote:
>>>>>>>
>>>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote:
>>>>>>>>
>>>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote:
>>>>>>>>>
>>>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote:
>>>>>>>>>>
>>>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote:
>>>>>>>>>>>
>>>>>>>>>>[snip]
>>>>>>>>>>>>
>>>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest
>>>>>>>>>>>>stable source today, not the experimental version, available, too). If it would
>>>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you
>>>>>>>>>>>>explained.
>>>>>>>>>>>
>>>>>>>>>>>4M pages solves it for at least 250mb worth of RAM.  But then again, _no_ chess
>>>>>>>>>>>program depends on purely random memory accesses to blow out the TLB.  The only
>>>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which
>>>>>>>>>>>both total to significantly less than the total nodes I search.  Which means
>>>>>>>>>>>the TLB penalty is not even 1% of my total run time.  Probably closer to
>>>>>>>>>>>.01% - .05%.
>>>>>>>>>>>
>>>>>>>>>>>I ignore that.
>>>>>>>>>>
>>>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with
>>>>>>>>>>crafty.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Look at the code.
>>>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete
>>>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time
>>>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about
>>>>>>>>half.
>>>>>>>>
>>>>>>>>>Crafty probes memory _once_ for a hash probe.  That
>>>>>>>>>introduces a memory access penalty once per node in the basic search,
>>>>>>>>>less than once per node in the q-search (I only probe phash there and I
>>>>>>>>>don't probe it but about 25% of the q-search nodes I visit).
>>>>>>>>
>>>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe
>>>>>>>>20-30% of its total nodes.
>>>>>>>
>>>>>>>OK.  I clearly mis-read what you meant.  the 20-30% was eye-catching as that
>>>>>>>is a pretty common hash hit percentage as well...
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>As a result, you get less than one probe per node searched.  A node searched
>>>>>>>>>requires something on the order of 3000-5000 instructions.  What percentage
>>>>>>>>>of that 3K-5K instruction timing is that single hash probe?  Almost zero.
>>>>>>>>
>>>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache
>>>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0.
>>>>>>>
>>>>>>>You are missing my point.  In the position(s) you tested, you saw 20-30%
>>>>>>>hash probes.  That means one probe for every 3-5 nodes.  At 1M nodes
>>>>>>>per second, that is 200K-300K probes per second.  If you measure the
>>>>>>>time spent in searching a single node, multiply that by 3-5X, then compare
>>>>>>>that to the hash probe time, the time spent probing the hash table is low.
>>>>>>>
>>>>>>>Note that your 5% is _not_ the total time used to probe the table.  It is
>>>>>>>the time to probe the table, and do it _twice_ although the second probe
>>>>>>>doesn't have any memory access penalty associated with it in most cases.
>>>>>>>
>>>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(),
>>>>>>>rather than being all memory access penalty...
>>>>>>
>>>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test
>>>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k
>>>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the
>>>>>>following code just before the call to RepetitionCheck() in search.c (slightly
>>>>>>modified from the code in hash.c). Note that the code is basically a no-op as
>>>>>>all variables are local.
>>>>>>
>>>>>>{
>>>>>>  static BITBOARD word1;
>>>>>>  BITBOARD temp_hashkey;
>>>>>>  HASH_ENTRY *htable;
>>>>>>/*
>>>>>> ----------------------------------------------------------
>>>>>>|                                                          |
>>>>>>|   first, compute the initial hash address and choose     |
>>>>>>|   which hash table (based on color) to probe.            |
>>>>>>|                                                          |
>>>>>> ----------------------------------------------------------
>>>>>>*/
>>>>>>
>>>>>>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>>>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>>>  word1=htable->word1;
>>>>>>}
>>>>>>
>>>>>>Now crafty spends 2.8% of its time in HashProbe.
>>>>>
>>>>>Hi Wesley,
>>>>>
>>>>>that's interesting, it seems that preloading decreases the hash-latency.
>>>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing
>>>>>L2-Cache) is even better.
>>>>>
>>>>>Gerd
>>>>
>>>>I'm sure it would be better. My code doesn't make it run any faster, it just
>>>>shows that the delay due to memory access is significant.
>>>>
>>>
>>>Can you tell me how you conclude this?
>>
>>The *only* effect of the code I added is to ensure that the depth-preferred part
>>of the hash table is put into cache, so any speedup in HashProbe is due to not
>>having a cache (and ATB) miss.
>
>I'm not sure you can even deduce that.  IE repetition check goes thru a long
>sequential list of memory values.  It is possible that they interact with the
>cache line with the hash entry.
>
>However, I now understand what you did.  You just replicated the code from
>HashProbe() to pre-fetch something that should be used again very quickly.
>
>>
>>>
>>>IE there are two parts in HashProbe();
>>>
>>>1.  probe "depth-preferred table".
>>>
>>>2.  probe "always-store" table".
>>>
>>>You are assuming that of the total 3.6% done in HashProbe(), that .8% is
>>>done in the always-store code.  Which means that .8% is done in the depth-
>>>preferred table, and the remaining time is memory latency.
>>>
>>>I don't think that is the explanation.
>>>
>>>Suppose _many_ hits occur in the depth-preferred table.  Then you won't be
>>>probing the always-store table at those positions.  And your .8% assumption
>>>is not so safe to make.  Unless you run huge searches with a small table,
>>>this effect will distort any possible conclusions.
>>>
>>>
>>>No way a single random access memory read is 3% of the total time spent
>>>doing a node.  There are way too many _other_ random-access reads done in
>>>crafty to make that possible.  The total time would go over 100%.
>>
>>At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one
>>cache miss is about 150 nsec. This implies that if you have *one* cache miss
>>every 4 nodes, you will spend 3% on that single random access memory read.
>>Apparently, caching works very well for crafty, except for HashProbe( ;).
>
>however, we know more than that.  IE 1usec/node is an _average_.  _every_
>node has a call to HashProbe().

We've been here before. Quiesce nodes don't call HashProbe().

>Which means _every_ node is likely going to
>see a cache miss for that particular memory read.  The entire point of the
>Zobrist hashing scheme is that a single change on the chess board produces a
>wild change in the hash signature, sending us to a different page of memory
>and the resulting hash miss.  If you use 450ns for a miss, you get one of
>those _every_ node.  What about the _other_ cache misses that happen?   one
>more and you are _done_ as you just exhausted the entire 1usec of time you
>have.  Conclusion?  it isn't taking _that_ long.
>

A simple experiment is to run bench with the smallest and largest possible hash
sizes and compare the raw node times. The difference should be largely
attrubutable to memory latency. You may want to run it several times to see how
much variability there is between runs.

[snip]



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.