Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Another memory latency test

Author: Robert Hyatt

Date: 21:19:33 07/23/03

Go up one level in this thread


On July 23, 2003 at 18:37:03, Vincent Diepeveen wrote:

>On July 23, 2003 at 12:58:56, J. Wesley Cleveland wrote:
>
>>On July 22, 2003 at 16:07:08, Robert Hyatt wrote:
>>
>>>On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote:
>>>
>>>>On July 22, 2003 at 14:27:45, Robert Hyatt wrote:
>>>>
>>>>>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote:
>>>>>
>>>>>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote:
>>>>>>
>>>>>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote:
>>>>>>>
>>>>>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote:
>>>>>>>>
>>>>>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote:
>>>>>>>>>
>>>>>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote:
>>>>>>>>>>
>>>>>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote:
>>>>>>>>>>>
>>>>>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>[snip]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest
>>>>>>>>>>>>>>stable source today, not the experimental version, available, too). If it would
>>>>>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you
>>>>>>>>>>>>>>explained.
>>>>>>>>>>>>>
>>>>>>>>>>>>>4M pages solves it for at least 250mb worth of RAM.  But then again, _no_ chess
>>>>>>>>>>>>>program depends on purely random memory accesses to blow out the TLB.  The only
>>>>>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which
>>>>>>>>>>>>>both total to significantly less than the total nodes I search.  Which means
>>>>>>>>>>>>>the TLB penalty is not even 1% of my total run time.  Probably closer to
>>>>>>>>>>>>>.01% - .05%.
>>>>>>>>>>>>>
>>>>>>>>>>>>>I ignore that.
>>>>>>>>>>>>
>>>>>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with
>>>>>>>>>>>>crafty.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>Look at the code.
>>>>>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete
>>>>>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time
>>>>>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about
>>>>>>>>>>half.
>>>>>>>>>>
>>>>>>>>>>>Crafty probes memory _once_ for a hash probe.  That
>>>>>>>>>>>introduces a memory access penalty once per node in the basic search,
>>>>>>>>>>>less than once per node in the q-search (I only probe phash there and I
>>>>>>>>>>>don't probe it but about 25% of the q-search nodes I visit).
>>>>>>>>>>
>>>>>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe
>>>>>>>>>>20-30% of its total nodes.
>>>>>>>>>
>>>>>>>>>OK.  I clearly mis-read what you meant.  the 20-30% was eye-catching as that
>>>>>>>>>is a pretty common hash hit percentage as well...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>As a result, you get less than one probe per node searched.  A node searched
>>>>>>>>>>>requires something on the order of 3000-5000 instructions.  What percentage
>>>>>>>>>>>of that 3K-5K instruction timing is that single hash probe?  Almost zero.
>>>>>>>>>>
>>>>>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache
>>>>>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0.
>>>>>>>>>
>>>>>>>>>You are missing my point.  In the position(s) you tested, you saw 20-30%
>>>>>>>>>hash probes.  That means one probe for every 3-5 nodes.  At 1M nodes
>>>>>>>>>per second, that is 200K-300K probes per second.  If you measure the
>>>>>>>>>time spent in searching a single node, multiply that by 3-5X, then compare
>>>>>>>>>that to the hash probe time, the time spent probing the hash table is low.
>>>>>>>>>
>>>>>>>>>Note that your 5% is _not_ the total time used to probe the table.  It is
>>>>>>>>>the time to probe the table, and do it _twice_ although the second probe
>>>>>>>>>doesn't have any memory access penalty associated with it in most cases.
>>>>>>>>>
>>>>>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(),
>>>>>>>>>rather than being all memory access penalty...
>>>>>>>>
>>>>>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test
>>>>>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k
>>>>>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the
>>>>>>>>following code just before the call to RepetitionCheck() in search.c (slightly
>>>>>>>>modified from the code in hash.c). Note that the code is basically a no-op as
>>>>>>>>all variables are local.
>>>>>>>>
>>>>>>>>{
>>>>>>>>  static BITBOARD word1;
>>>>>>>>  BITBOARD temp_hashkey;
>>>>>>>>  HASH_ENTRY *htable;
>>>>>>>>/*
>>>>>>>> ----------------------------------------------------------
>>>>>>>>|                                                          |
>>>>>>>>|   first, compute the initial hash address and choose     |
>>>>>>>>|   which hash table (based on color) to probe.            |
>>>>>>>>|                                                          |
>>>>>>>> ----------------------------------------------------------
>>>>>>>>*/
>>>>>>>>
>>>>>>>>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>>>>>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>>>>>  word1=htable->word1;
>>>>>>>>}
>>>>>>>>
>>>>>>>>Now crafty spends 2.8% of its time in HashProbe.
>>>>>>>
>>>>>>>Hi Wesley,
>>>>>>>
>>>>>>>that's interesting, it seems that preloading decreases the hash-latency.
>>>>>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing
>>>>>>>L2-Cache) is even better.
>>>>>>>
>>>>>>>Gerd
>>>>>>
>>>>>>I'm sure it would be better. My code doesn't make it run any faster, it just
>>>>>>shows that the delay due to memory access is significant.
>>>>>>
>>>>>
>>>>>Can you tell me how you conclude this?
>>>>
>>>>The *only* effect of the code I added is to ensure that the depth-preferred part
>>>>of the hash table is put into cache, so any speedup in HashProbe is due to not
>>>>having a cache (and ATB) miss.
>>>
>>>I'm not sure you can even deduce that.  IE repetition check goes thru a long
>>>sequential list of memory values.  It is possible that they interact with the
>>>cache line with the hash entry.
>>>
>>>However, I now understand what you did.  You just replicated the code from
>>>HashProbe() to pre-fetch something that should be used again very quickly.
>>>
>>>>
>>>>>
>>>>>IE there are two parts in HashProbe();
>>>>>
>>>>>1.  probe "depth-preferred table".
>>>>>
>>>>>2.  probe "always-store" table".
>>>>>
>>>>>You are assuming that of the total 3.6% done in HashProbe(), that .8% is
>>>>>done in the always-store code.  Which means that .8% is done in the depth-
>>>>>preferred table, and the remaining time is memory latency.
>>>>>
>>>>>I don't think that is the explanation.
>>>>>
>>>>>Suppose _many_ hits occur in the depth-preferred table.  Then you won't be
>>>>>probing the always-store table at those positions.  And your .8% assumption
>>>>>is not so safe to make.  Unless you run huge searches with a small table,
>>>>>this effect will distort any possible conclusions.
>>>>>
>>>>>
>>>>>No way a single random access memory read is 3% of the total time spent
>>>>>doing a node.  There are way too many _other_ random-access reads done in
>>>>>crafty to make that possible.  The total time would go over 100%.
>>>>
>>>>At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one
>>>>cache miss is about 150 nsec. This implies that if you have *one* cache miss
>>>>every 4 nodes, you will spend 3% on that single random access memory read.
>>>>Apparently, caching works very well for crafty, except for HashProbe( ;).
>>>
>>>however, we know more than that.  IE 1usec/node is an _average_.  _every_
>>>node has a call to HashProbe().
>>
>>We've been here before. Quiesce nodes don't call HashProbe().
>
>They do call pawnhashtable at every evaluated leaf and 40% of the nodes is not
>qsearch.

I don't.  It is very likely that two consecutive nodes searched in the q-search
have the _same_ pawn hash signature.  I don't probe the table a second time for
that, I just use the previous values already found...  But how often this
happens is _not_ easy to guestimate.


>
>For 40% of the nodes you do both a READ and a WRITE.
>
>A write is MORE expensive than a READ when done to a random location.

Why is that?


>
>Because after each node in normal search you already first go do a lot of stuff
>for in qsearch, you sure can't assume that this slot is still opened to memory.
>
>So the WRITE is in 90% of the cases very expensive.
>
>So the faster the cpu the higher the % you lose to latency of RAM.
>
>We're talking about 10% for sure at your dual machine.

Not a _chance_.  HashProbe() in _total_ is not 10% of my search.  about 3% is
what most profile runs show...

Here is one short run:

  1.74     43.17     0.88  1576944     0.00     0.00  HashProbe

1.74% of _total_ time spent in HashProbe.  Not 10% for memory accessing for
hash probe, but 1.74% _total_ including the memory accessing as well as the
stuff HashProbe does internally.  This was with a hash table using 384M

>
>When measured for specint at a single cpu machine it was already nearly 10%
>of system time. As we know latency of a dual machine is quite something slower.
>

HashProbe() is _not_ 10% of SpecInt for Crafty.  I have the source for that
version handy.  I ran it above.  That was pretty old code (prior to my newer
single table approach).  It did the two-table approach that I used when the
SPEC stuff was done.





>Not comparable with quads.

latency is latency.  Bandwidth is bandwidth.  Quads have the same latency as
duals and singles.  They usually have higher bandwidth due to 4-way
interleaving.

>
>Best regards,
>Vincent
>
>>>Which means _every_ node is likely going to
>>>see a cache miss for that particular memory read.  The entire point of the
>>>Zobrist hashing scheme is that a single change on the chess board produces a
>>>wild change in the hash signature, sending us to a different page of memory
>>>and the resulting hash miss.  If you use 450ns for a miss, you get one of
>>>those _every_ node.  What about the _other_ cache misses that happen?   one
>>>more and you are _done_ as you just exhausted the entire 1usec of time you
>>>have.  Conclusion?  it isn't taking _that_ long.
>>>
>>
>>A simple experiment is to run bench with the smallest and largest possible hash
>>sizes and compare the raw node times. The difference should be largely
>>attrubutable to memory latency. You may want to run it several times to see how
>>much variability there is between runs.
>>
>>[snip]



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.