Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Another memory latency test

Author: Vincent Diepeveen

Date: 15:37:03 07/23/03

Go up one level in this thread


On July 23, 2003 at 12:58:56, J. Wesley Cleveland wrote:

>On July 22, 2003 at 16:07:08, Robert Hyatt wrote:
>
>>On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote:
>>
>>>On July 22, 2003 at 14:27:45, Robert Hyatt wrote:
>>>
>>>>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote:
>>>>
>>>>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote:
>>>>>
>>>>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote:
>>>>>>
>>>>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote:
>>>>>>>>
>>>>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote:
>>>>>>>>>
>>>>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote:
>>>>>>>>>>
>>>>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote:
>>>>>>>>>>>
>>>>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote:
>>>>>>>>>>>>
>>>>>>>>>>>[snip]
>>>>>>>>>>>>>
>>>>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest
>>>>>>>>>>>>>stable source today, not the experimental version, available, too). If it would
>>>>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you
>>>>>>>>>>>>>explained.
>>>>>>>>>>>>
>>>>>>>>>>>>4M pages solves it for at least 250mb worth of RAM.  But then again, _no_ chess
>>>>>>>>>>>>program depends on purely random memory accesses to blow out the TLB.  The only
>>>>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which
>>>>>>>>>>>>both total to significantly less than the total nodes I search.  Which means
>>>>>>>>>>>>the TLB penalty is not even 1% of my total run time.  Probably closer to
>>>>>>>>>>>>.01% - .05%.
>>>>>>>>>>>>
>>>>>>>>>>>>I ignore that.
>>>>>>>>>>>
>>>>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with
>>>>>>>>>>>crafty.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Look at the code.
>>>>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete
>>>>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time
>>>>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about
>>>>>>>>>half.
>>>>>>>>>
>>>>>>>>>>Crafty probes memory _once_ for a hash probe.  That
>>>>>>>>>>introduces a memory access penalty once per node in the basic search,
>>>>>>>>>>less than once per node in the q-search (I only probe phash there and I
>>>>>>>>>>don't probe it but about 25% of the q-search nodes I visit).
>>>>>>>>>
>>>>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe
>>>>>>>>>20-30% of its total nodes.
>>>>>>>>
>>>>>>>>OK.  I clearly mis-read what you meant.  the 20-30% was eye-catching as that
>>>>>>>>is a pretty common hash hit percentage as well...
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>As a result, you get less than one probe per node searched.  A node searched
>>>>>>>>>>requires something on the order of 3000-5000 instructions.  What percentage
>>>>>>>>>>of that 3K-5K instruction timing is that single hash probe?  Almost zero.
>>>>>>>>>
>>>>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache
>>>>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0.
>>>>>>>>
>>>>>>>>You are missing my point.  In the position(s) you tested, you saw 20-30%
>>>>>>>>hash probes.  That means one probe for every 3-5 nodes.  At 1M nodes
>>>>>>>>per second, that is 200K-300K probes per second.  If you measure the
>>>>>>>>time spent in searching a single node, multiply that by 3-5X, then compare
>>>>>>>>that to the hash probe time, the time spent probing the hash table is low.
>>>>>>>>
>>>>>>>>Note that your 5% is _not_ the total time used to probe the table.  It is
>>>>>>>>the time to probe the table, and do it _twice_ although the second probe
>>>>>>>>doesn't have any memory access penalty associated with it in most cases.
>>>>>>>>
>>>>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(),
>>>>>>>>rather than being all memory access penalty...
>>>>>>>
>>>>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test
>>>>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k
>>>>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the
>>>>>>>following code just before the call to RepetitionCheck() in search.c (slightly
>>>>>>>modified from the code in hash.c). Note that the code is basically a no-op as
>>>>>>>all variables are local.
>>>>>>>
>>>>>>>{
>>>>>>>  static BITBOARD word1;
>>>>>>>  BITBOARD temp_hashkey;
>>>>>>>  HASH_ENTRY *htable;
>>>>>>>/*
>>>>>>> ----------------------------------------------------------
>>>>>>>|                                                          |
>>>>>>>|   first, compute the initial hash address and choose     |
>>>>>>>|   which hash table (based on color) to probe.            |
>>>>>>>|                                                          |
>>>>>>> ----------------------------------------------------------
>>>>>>>*/
>>>>>>>
>>>>>>>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>>>>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>>>>  word1=htable->word1;
>>>>>>>}
>>>>>>>
>>>>>>>Now crafty spends 2.8% of its time in HashProbe.
>>>>>>
>>>>>>Hi Wesley,
>>>>>>
>>>>>>that's interesting, it seems that preloading decreases the hash-latency.
>>>>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing
>>>>>>L2-Cache) is even better.
>>>>>>
>>>>>>Gerd
>>>>>
>>>>>I'm sure it would be better. My code doesn't make it run any faster, it just
>>>>>shows that the delay due to memory access is significant.
>>>>>
>>>>
>>>>Can you tell me how you conclude this?
>>>
>>>The *only* effect of the code I added is to ensure that the depth-preferred part
>>>of the hash table is put into cache, so any speedup in HashProbe is due to not
>>>having a cache (and ATB) miss.
>>
>>I'm not sure you can even deduce that.  IE repetition check goes thru a long
>>sequential list of memory values.  It is possible that they interact with the
>>cache line with the hash entry.
>>
>>However, I now understand what you did.  You just replicated the code from
>>HashProbe() to pre-fetch something that should be used again very quickly.
>>
>>>
>>>>
>>>>IE there are two parts in HashProbe();
>>>>
>>>>1.  probe "depth-preferred table".
>>>>
>>>>2.  probe "always-store" table".
>>>>
>>>>You are assuming that of the total 3.6% done in HashProbe(), that .8% is
>>>>done in the always-store code.  Which means that .8% is done in the depth-
>>>>preferred table, and the remaining time is memory latency.
>>>>
>>>>I don't think that is the explanation.
>>>>
>>>>Suppose _many_ hits occur in the depth-preferred table.  Then you won't be
>>>>probing the always-store table at those positions.  And your .8% assumption
>>>>is not so safe to make.  Unless you run huge searches with a small table,
>>>>this effect will distort any possible conclusions.
>>>>
>>>>
>>>>No way a single random access memory read is 3% of the total time spent
>>>>doing a node.  There are way too many _other_ random-access reads done in
>>>>crafty to make that possible.  The total time would go over 100%.
>>>
>>>At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one
>>>cache miss is about 150 nsec. This implies that if you have *one* cache miss
>>>every 4 nodes, you will spend 3% on that single random access memory read.
>>>Apparently, caching works very well for crafty, except for HashProbe( ;).
>>
>>however, we know more than that.  IE 1usec/node is an _average_.  _every_
>>node has a call to HashProbe().
>
>We've been here before. Quiesce nodes don't call HashProbe().

They do call pawnhashtable at every evaluated leaf and 40% of the nodes is not
qsearch.

For 40% of the nodes you do both a READ and a WRITE.

A write is MORE expensive than a READ when done to a random location.

Because after each node in normal search you already first go do a lot of stuff
for in qsearch, you sure can't assume that this slot is still opened to memory.

So the WRITE is in 90% of the cases very expensive.

So the faster the cpu the higher the % you lose to latency of RAM.

We're talking about 10% for sure at your dual machine.

When measured for specint at a single cpu machine it was already nearly 10%
of system time. As we know latency of a dual machine is quite something slower.

Not comparable with quads.

Best regards,
Vincent

>>Which means _every_ node is likely going to
>>see a cache miss for that particular memory read.  The entire point of the
>>Zobrist hashing scheme is that a single change on the chess board produces a
>>wild change in the hash signature, sending us to a different page of memory
>>and the resulting hash miss.  If you use 450ns for a miss, you get one of
>>those _every_ node.  What about the _other_ cache misses that happen?   one
>>more and you are _done_ as you just exhausted the entire 1usec of time you
>>have.  Conclusion?  it isn't taking _that_ long.
>>
>
>A simple experiment is to run bench with the smallest and largest possible hash
>sizes and compare the raw node times. The difference should be largely
>attrubutable to memory latency. You may want to run it several times to see how
>much variability there is between runs.
>
>[snip]



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.