Author: Robert Hyatt
Date: 21:19:33 07/23/03
Go up one level in this thread
On July 23, 2003 at 18:37:03, Vincent Diepeveen wrote: >On July 23, 2003 at 12:58:56, J. Wesley Cleveland wrote: > >>On July 22, 2003 at 16:07:08, Robert Hyatt wrote: >> >>>On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote: >>> >>>>On July 22, 2003 at 14:27:45, Robert Hyatt wrote: >>>> >>>>>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote: >>>>> >>>>>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote: >>>>>> >>>>>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote: >>>>>>> >>>>>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote: >>>>>>>> >>>>>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote: >>>>>>>>> >>>>>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote: >>>>>>>>>> >>>>>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote: >>>>>>>>>>> >>>>>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote: >>>>>>>>>>>> >>>>>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote: >>>>>>>>>>>>> >>>>>>>>>>>>[snip] >>>>>>>>>>>>>> >>>>>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest >>>>>>>>>>>>>>stable source today, not the experimental version, available, too). If it would >>>>>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you >>>>>>>>>>>>>>explained. >>>>>>>>>>>>> >>>>>>>>>>>>>4M pages solves it for at least 250mb worth of RAM. But then again, _no_ chess >>>>>>>>>>>>>program depends on purely random memory accesses to blow out the TLB. The only >>>>>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which >>>>>>>>>>>>>both total to significantly less than the total nodes I search. Which means >>>>>>>>>>>>>the TLB penalty is not even 1% of my total run time. Probably closer to >>>>>>>>>>>>>.01% - .05%. >>>>>>>>>>>>> >>>>>>>>>>>>>I ignore that. >>>>>>>>>>>> >>>>>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with >>>>>>>>>>>>crafty. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>Look at the code. >>>>>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete >>>>>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time >>>>>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about >>>>>>>>>>half. >>>>>>>>>> >>>>>>>>>>>Crafty probes memory _once_ for a hash probe. That >>>>>>>>>>>introduces a memory access penalty once per node in the basic search, >>>>>>>>>>>less than once per node in the q-search (I only probe phash there and I >>>>>>>>>>>don't probe it but about 25% of the q-search nodes I visit). >>>>>>>>>> >>>>>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe >>>>>>>>>>20-30% of its total nodes. >>>>>>>>> >>>>>>>>>OK. I clearly mis-read what you meant. the 20-30% was eye-catching as that >>>>>>>>>is a pretty common hash hit percentage as well... >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>>As a result, you get less than one probe per node searched. A node searched >>>>>>>>>>>requires something on the order of 3000-5000 instructions. What percentage >>>>>>>>>>>of that 3K-5K instruction timing is that single hash probe? Almost zero. >>>>>>>>>> >>>>>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache >>>>>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0. >>>>>>>>> >>>>>>>>>You are missing my point. In the position(s) you tested, you saw 20-30% >>>>>>>>>hash probes. That means one probe for every 3-5 nodes. At 1M nodes >>>>>>>>>per second, that is 200K-300K probes per second. If you measure the >>>>>>>>>time spent in searching a single node, multiply that by 3-5X, then compare >>>>>>>>>that to the hash probe time, the time spent probing the hash table is low. >>>>>>>>> >>>>>>>>>Note that your 5% is _not_ the total time used to probe the table. It is >>>>>>>>>the time to probe the table, and do it _twice_ although the second probe >>>>>>>>>doesn't have any memory access penalty associated with it in most cases. >>>>>>>>> >>>>>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(), >>>>>>>>>rather than being all memory access penalty... >>>>>>>> >>>>>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test >>>>>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k >>>>>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the >>>>>>>>following code just before the call to RepetitionCheck() in search.c (slightly >>>>>>>>modified from the code in hash.c). Note that the code is basically a no-op as >>>>>>>>all variables are local. >>>>>>>> >>>>>>>>{ >>>>>>>> static BITBOARD word1; >>>>>>>> BITBOARD temp_hashkey; >>>>>>>> HASH_ENTRY *htable; >>>>>>>>/* >>>>>>>> ---------------------------------------------------------- >>>>>>>>| | >>>>>>>>| first, compute the initial hash address and choose | >>>>>>>>| which hash table (based on color) to probe. | >>>>>>>>| | >>>>>>>> ---------------------------------------------------------- >>>>>>>>*/ >>>>>>>> >>>>>>>> temp_hashkey=(wtm) ? HashKey : ~HashKey; >>>>>>>> htable=trans_ref_a+((int) temp_hashkey&hash_maska); >>>>>>>> word1=htable->word1; >>>>>>>>} >>>>>>>> >>>>>>>>Now crafty spends 2.8% of its time in HashProbe. >>>>>>> >>>>>>>Hi Wesley, >>>>>>> >>>>>>>that's interesting, it seems that preloading decreases the hash-latency. >>>>>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing >>>>>>>L2-Cache) is even better. >>>>>>> >>>>>>>Gerd >>>>>> >>>>>>I'm sure it would be better. My code doesn't make it run any faster, it just >>>>>>shows that the delay due to memory access is significant. >>>>>> >>>>> >>>>>Can you tell me how you conclude this? >>>> >>>>The *only* effect of the code I added is to ensure that the depth-preferred part >>>>of the hash table is put into cache, so any speedup in HashProbe is due to not >>>>having a cache (and ATB) miss. >>> >>>I'm not sure you can even deduce that. IE repetition check goes thru a long >>>sequential list of memory values. It is possible that they interact with the >>>cache line with the hash entry. >>> >>>However, I now understand what you did. You just replicated the code from >>>HashProbe() to pre-fetch something that should be used again very quickly. >>> >>>> >>>>> >>>>>IE there are two parts in HashProbe(); >>>>> >>>>>1. probe "depth-preferred table". >>>>> >>>>>2. probe "always-store" table". >>>>> >>>>>You are assuming that of the total 3.6% done in HashProbe(), that .8% is >>>>>done in the always-store code. Which means that .8% is done in the depth- >>>>>preferred table, and the remaining time is memory latency. >>>>> >>>>>I don't think that is the explanation. >>>>> >>>>>Suppose _many_ hits occur in the depth-preferred table. Then you won't be >>>>>probing the always-store table at those positions. And your .8% assumption >>>>>is not so safe to make. Unless you run huge searches with a small table, >>>>>this effect will distort any possible conclusions. >>>>> >>>>> >>>>>No way a single random access memory read is 3% of the total time spent >>>>>doing a node. There are way too many _other_ random-access reads done in >>>>>crafty to make that possible. The total time would go over 100%. >>>> >>>>At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one >>>>cache miss is about 150 nsec. This implies that if you have *one* cache miss >>>>every 4 nodes, you will spend 3% on that single random access memory read. >>>>Apparently, caching works very well for crafty, except for HashProbe( ;). >>> >>>however, we know more than that. IE 1usec/node is an _average_. _every_ >>>node has a call to HashProbe(). >> >>We've been here before. Quiesce nodes don't call HashProbe(). > >They do call pawnhashtable at every evaluated leaf and 40% of the nodes is not >qsearch. I don't. It is very likely that two consecutive nodes searched in the q-search have the _same_ pawn hash signature. I don't probe the table a second time for that, I just use the previous values already found... But how often this happens is _not_ easy to guestimate. > >For 40% of the nodes you do both a READ and a WRITE. > >A write is MORE expensive than a READ when done to a random location. Why is that? > >Because after each node in normal search you already first go do a lot of stuff >for in qsearch, you sure can't assume that this slot is still opened to memory. > >So the WRITE is in 90% of the cases very expensive. > >So the faster the cpu the higher the % you lose to latency of RAM. > >We're talking about 10% for sure at your dual machine. Not a _chance_. HashProbe() in _total_ is not 10% of my search. about 3% is what most profile runs show... Here is one short run: 1.74 43.17 0.88 1576944 0.00 0.00 HashProbe 1.74% of _total_ time spent in HashProbe. Not 10% for memory accessing for hash probe, but 1.74% _total_ including the memory accessing as well as the stuff HashProbe does internally. This was with a hash table using 384M > >When measured for specint at a single cpu machine it was already nearly 10% >of system time. As we know latency of a dual machine is quite something slower. > HashProbe() is _not_ 10% of SpecInt for Crafty. I have the source for that version handy. I ran it above. That was pretty old code (prior to my newer single table approach). It did the two-table approach that I used when the SPEC stuff was done. >Not comparable with quads. latency is latency. Bandwidth is bandwidth. Quads have the same latency as duals and singles. They usually have higher bandwidth due to 4-way interleaving. > >Best regards, >Vincent > >>>Which means _every_ node is likely going to >>>see a cache miss for that particular memory read. The entire point of the >>>Zobrist hashing scheme is that a single change on the chess board produces a >>>wild change in the hash signature, sending us to a different page of memory >>>and the resulting hash miss. If you use 450ns for a miss, you get one of >>>those _every_ node. What about the _other_ cache misses that happen? one >>>more and you are _done_ as you just exhausted the entire 1usec of time you >>>have. Conclusion? it isn't taking _that_ long. >>> >> >>A simple experiment is to run bench with the smallest and largest possible hash >>sizes and compare the raw node times. The difference should be largely >>attrubutable to memory latency. You may want to run it several times to see how >>much variability there is between runs. >> >>[snip]
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.