Author: Vincent Diepeveen
Date: 15:37:03 07/23/03
Go up one level in this thread
On July 23, 2003 at 12:58:56, J. Wesley Cleveland wrote:
>On July 22, 2003 at 16:07:08, Robert Hyatt wrote:
>
>>On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote:
>>
>>>On July 22, 2003 at 14:27:45, Robert Hyatt wrote:
>>>
>>>>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote:
>>>>
>>>>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote:
>>>>>
>>>>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote:
>>>>>>
>>>>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote:
>>>>>>>>
>>>>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote:
>>>>>>>>>
>>>>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote:
>>>>>>>>>>
>>>>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote:
>>>>>>>>>>>
>>>>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote:
>>>>>>>>>>>>
>>>>>>>>>>>[snip]
>>>>>>>>>>>>>
>>>>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest
>>>>>>>>>>>>>stable source today, not the experimental version, available, too). If it would
>>>>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you
>>>>>>>>>>>>>explained.
>>>>>>>>>>>>
>>>>>>>>>>>>4M pages solves it for at least 250mb worth of RAM. But then again, _no_ chess
>>>>>>>>>>>>program depends on purely random memory accesses to blow out the TLB. The only
>>>>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which
>>>>>>>>>>>>both total to significantly less than the total nodes I search. Which means
>>>>>>>>>>>>the TLB penalty is not even 1% of my total run time. Probably closer to
>>>>>>>>>>>>.01% - .05%.
>>>>>>>>>>>>
>>>>>>>>>>>>I ignore that.
>>>>>>>>>>>
>>>>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with
>>>>>>>>>>>crafty.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Look at the code.
>>>>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete
>>>>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time
>>>>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about
>>>>>>>>>half.
>>>>>>>>>
>>>>>>>>>>Crafty probes memory _once_ for a hash probe. That
>>>>>>>>>>introduces a memory access penalty once per node in the basic search,
>>>>>>>>>>less than once per node in the q-search (I only probe phash there and I
>>>>>>>>>>don't probe it but about 25% of the q-search nodes I visit).
>>>>>>>>>
>>>>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe
>>>>>>>>>20-30% of its total nodes.
>>>>>>>>
>>>>>>>>OK. I clearly mis-read what you meant. the 20-30% was eye-catching as that
>>>>>>>>is a pretty common hash hit percentage as well...
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>As a result, you get less than one probe per node searched. A node searched
>>>>>>>>>>requires something on the order of 3000-5000 instructions. What percentage
>>>>>>>>>>of that 3K-5K instruction timing is that single hash probe? Almost zero.
>>>>>>>>>
>>>>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache
>>>>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0.
>>>>>>>>
>>>>>>>>You are missing my point. In the position(s) you tested, you saw 20-30%
>>>>>>>>hash probes. That means one probe for every 3-5 nodes. At 1M nodes
>>>>>>>>per second, that is 200K-300K probes per second. If you measure the
>>>>>>>>time spent in searching a single node, multiply that by 3-5X, then compare
>>>>>>>>that to the hash probe time, the time spent probing the hash table is low.
>>>>>>>>
>>>>>>>>Note that your 5% is _not_ the total time used to probe the table. It is
>>>>>>>>the time to probe the table, and do it _twice_ although the second probe
>>>>>>>>doesn't have any memory access penalty associated with it in most cases.
>>>>>>>>
>>>>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(),
>>>>>>>>rather than being all memory access penalty...
>>>>>>>
>>>>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test
>>>>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k
>>>>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the
>>>>>>>following code just before the call to RepetitionCheck() in search.c (slightly
>>>>>>>modified from the code in hash.c). Note that the code is basically a no-op as
>>>>>>>all variables are local.
>>>>>>>
>>>>>>>{
>>>>>>> static BITBOARD word1;
>>>>>>> BITBOARD temp_hashkey;
>>>>>>> HASH_ENTRY *htable;
>>>>>>>/*
>>>>>>> ----------------------------------------------------------
>>>>>>>| |
>>>>>>>| first, compute the initial hash address and choose |
>>>>>>>| which hash table (based on color) to probe. |
>>>>>>>| |
>>>>>>> ----------------------------------------------------------
>>>>>>>*/
>>>>>>>
>>>>>>> temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>>>> htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>>>> word1=htable->word1;
>>>>>>>}
>>>>>>>
>>>>>>>Now crafty spends 2.8% of its time in HashProbe.
>>>>>>
>>>>>>Hi Wesley,
>>>>>>
>>>>>>that's interesting, it seems that preloading decreases the hash-latency.
>>>>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing
>>>>>>L2-Cache) is even better.
>>>>>>
>>>>>>Gerd
>>>>>
>>>>>I'm sure it would be better. My code doesn't make it run any faster, it just
>>>>>shows that the delay due to memory access is significant.
>>>>>
>>>>
>>>>Can you tell me how you conclude this?
>>>
>>>The *only* effect of the code I added is to ensure that the depth-preferred part
>>>of the hash table is put into cache, so any speedup in HashProbe is due to not
>>>having a cache (and ATB) miss.
>>
>>I'm not sure you can even deduce that. IE repetition check goes thru a long
>>sequential list of memory values. It is possible that they interact with the
>>cache line with the hash entry.
>>
>>However, I now understand what you did. You just replicated the code from
>>HashProbe() to pre-fetch something that should be used again very quickly.
>>
>>>
>>>>
>>>>IE there are two parts in HashProbe();
>>>>
>>>>1. probe "depth-preferred table".
>>>>
>>>>2. probe "always-store" table".
>>>>
>>>>You are assuming that of the total 3.6% done in HashProbe(), that .8% is
>>>>done in the always-store code. Which means that .8% is done in the depth-
>>>>preferred table, and the remaining time is memory latency.
>>>>
>>>>I don't think that is the explanation.
>>>>
>>>>Suppose _many_ hits occur in the depth-preferred table. Then you won't be
>>>>probing the always-store table at those positions. And your .8% assumption
>>>>is not so safe to make. Unless you run huge searches with a small table,
>>>>this effect will distort any possible conclusions.
>>>>
>>>>
>>>>No way a single random access memory read is 3% of the total time spent
>>>>doing a node. There are way too many _other_ random-access reads done in
>>>>crafty to make that possible. The total time would go over 100%.
>>>
>>>At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one
>>>cache miss is about 150 nsec. This implies that if you have *one* cache miss
>>>every 4 nodes, you will spend 3% on that single random access memory read.
>>>Apparently, caching works very well for crafty, except for HashProbe( ;).
>>
>>however, we know more than that. IE 1usec/node is an _average_. _every_
>>node has a call to HashProbe().
>
>We've been here before. Quiesce nodes don't call HashProbe().
They do call pawnhashtable at every evaluated leaf and 40% of the nodes is not
qsearch.
For 40% of the nodes you do both a READ and a WRITE.
A write is MORE expensive than a READ when done to a random location.
Because after each node in normal search you already first go do a lot of stuff
for in qsearch, you sure can't assume that this slot is still opened to memory.
So the WRITE is in 90% of the cases very expensive.
So the faster the cpu the higher the % you lose to latency of RAM.
We're talking about 10% for sure at your dual machine.
When measured for specint at a single cpu machine it was already nearly 10%
of system time. As we know latency of a dual machine is quite something slower.
Not comparable with quads.
Best regards,
Vincent
>>Which means _every_ node is likely going to
>>see a cache miss for that particular memory read. The entire point of the
>>Zobrist hashing scheme is that a single change on the chess board produces a
>>wild change in the hash signature, sending us to a different page of memory
>>and the resulting hash miss. If you use 450ns for a miss, you get one of
>>those _every_ node. What about the _other_ cache misses that happen? one
>>more and you are _done_ as you just exhausted the entire 1usec of time you
>>have. Conclusion? it isn't taking _that_ long.
>>
>
>A simple experiment is to run bench with the smallest and largest possible hash
>sizes and compare the raw node times. The difference should be largely
>attrubutable to memory latency. You may want to run it several times to see how
>much variability there is between runs.
>
>[snip]
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.