Author: Vincent Diepeveen
Date: 15:37:03 07/23/03
Go up one level in this thread
On July 23, 2003 at 12:58:56, J. Wesley Cleveland wrote: >On July 22, 2003 at 16:07:08, Robert Hyatt wrote: > >>On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote: >> >>>On July 22, 2003 at 14:27:45, Robert Hyatt wrote: >>> >>>>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote: >>>> >>>>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote: >>>>> >>>>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote: >>>>>> >>>>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote: >>>>>>> >>>>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote: >>>>>>>> >>>>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote: >>>>>>>>> >>>>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote: >>>>>>>>>> >>>>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote: >>>>>>>>>>> >>>>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote: >>>>>>>>>>>> >>>>>>>>>>>[snip] >>>>>>>>>>>>> >>>>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest >>>>>>>>>>>>>stable source today, not the experimental version, available, too). If it would >>>>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you >>>>>>>>>>>>>explained. >>>>>>>>>>>> >>>>>>>>>>>>4M pages solves it for at least 250mb worth of RAM. But then again, _no_ chess >>>>>>>>>>>>program depends on purely random memory accesses to blow out the TLB. The only >>>>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which >>>>>>>>>>>>both total to significantly less than the total nodes I search. Which means >>>>>>>>>>>>the TLB penalty is not even 1% of my total run time. Probably closer to >>>>>>>>>>>>.01% - .05%. >>>>>>>>>>>> >>>>>>>>>>>>I ignore that. >>>>>>>>>>> >>>>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with >>>>>>>>>>>crafty. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>Look at the code. >>>>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete >>>>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time >>>>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about >>>>>>>>>half. >>>>>>>>> >>>>>>>>>>Crafty probes memory _once_ for a hash probe. That >>>>>>>>>>introduces a memory access penalty once per node in the basic search, >>>>>>>>>>less than once per node in the q-search (I only probe phash there and I >>>>>>>>>>don't probe it but about 25% of the q-search nodes I visit). >>>>>>>>> >>>>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe >>>>>>>>>20-30% of its total nodes. >>>>>>>> >>>>>>>>OK. I clearly mis-read what you meant. the 20-30% was eye-catching as that >>>>>>>>is a pretty common hash hit percentage as well... >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>>As a result, you get less than one probe per node searched. A node searched >>>>>>>>>>requires something on the order of 3000-5000 instructions. What percentage >>>>>>>>>>of that 3K-5K instruction timing is that single hash probe? Almost zero. >>>>>>>>> >>>>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache >>>>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0. >>>>>>>> >>>>>>>>You are missing my point. In the position(s) you tested, you saw 20-30% >>>>>>>>hash probes. That means one probe for every 3-5 nodes. At 1M nodes >>>>>>>>per second, that is 200K-300K probes per second. If you measure the >>>>>>>>time spent in searching a single node, multiply that by 3-5X, then compare >>>>>>>>that to the hash probe time, the time spent probing the hash table is low. >>>>>>>> >>>>>>>>Note that your 5% is _not_ the total time used to probe the table. It is >>>>>>>>the time to probe the table, and do it _twice_ although the second probe >>>>>>>>doesn't have any memory access penalty associated with it in most cases. >>>>>>>> >>>>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(), >>>>>>>>rather than being all memory access penalty... >>>>>>> >>>>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test >>>>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k >>>>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the >>>>>>>following code just before the call to RepetitionCheck() in search.c (slightly >>>>>>>modified from the code in hash.c). Note that the code is basically a no-op as >>>>>>>all variables are local. >>>>>>> >>>>>>>{ >>>>>>> static BITBOARD word1; >>>>>>> BITBOARD temp_hashkey; >>>>>>> HASH_ENTRY *htable; >>>>>>>/* >>>>>>> ---------------------------------------------------------- >>>>>>>| | >>>>>>>| first, compute the initial hash address and choose | >>>>>>>| which hash table (based on color) to probe. | >>>>>>>| | >>>>>>> ---------------------------------------------------------- >>>>>>>*/ >>>>>>> >>>>>>> temp_hashkey=(wtm) ? HashKey : ~HashKey; >>>>>>> htable=trans_ref_a+((int) temp_hashkey&hash_maska); >>>>>>> word1=htable->word1; >>>>>>>} >>>>>>> >>>>>>>Now crafty spends 2.8% of its time in HashProbe. >>>>>> >>>>>>Hi Wesley, >>>>>> >>>>>>that's interesting, it seems that preloading decreases the hash-latency. >>>>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing >>>>>>L2-Cache) is even better. >>>>>> >>>>>>Gerd >>>>> >>>>>I'm sure it would be better. My code doesn't make it run any faster, it just >>>>>shows that the delay due to memory access is significant. >>>>> >>>> >>>>Can you tell me how you conclude this? >>> >>>The *only* effect of the code I added is to ensure that the depth-preferred part >>>of the hash table is put into cache, so any speedup in HashProbe is due to not >>>having a cache (and ATB) miss. >> >>I'm not sure you can even deduce that. IE repetition check goes thru a long >>sequential list of memory values. It is possible that they interact with the >>cache line with the hash entry. >> >>However, I now understand what you did. You just replicated the code from >>HashProbe() to pre-fetch something that should be used again very quickly. >> >>> >>>> >>>>IE there are two parts in HashProbe(); >>>> >>>>1. probe "depth-preferred table". >>>> >>>>2. probe "always-store" table". >>>> >>>>You are assuming that of the total 3.6% done in HashProbe(), that .8% is >>>>done in the always-store code. Which means that .8% is done in the depth- >>>>preferred table, and the remaining time is memory latency. >>>> >>>>I don't think that is the explanation. >>>> >>>>Suppose _many_ hits occur in the depth-preferred table. Then you won't be >>>>probing the always-store table at those positions. And your .8% assumption >>>>is not so safe to make. Unless you run huge searches with a small table, >>>>this effect will distort any possible conclusions. >>>> >>>> >>>>No way a single random access memory read is 3% of the total time spent >>>>doing a node. There are way too many _other_ random-access reads done in >>>>crafty to make that possible. The total time would go over 100%. >>> >>>At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one >>>cache miss is about 150 nsec. This implies that if you have *one* cache miss >>>every 4 nodes, you will spend 3% on that single random access memory read. >>>Apparently, caching works very well for crafty, except for HashProbe( ;). >> >>however, we know more than that. IE 1usec/node is an _average_. _every_ >>node has a call to HashProbe(). > >We've been here before. Quiesce nodes don't call HashProbe(). They do call pawnhashtable at every evaluated leaf and 40% of the nodes is not qsearch. For 40% of the nodes you do both a READ and a WRITE. A write is MORE expensive than a READ when done to a random location. Because after each node in normal search you already first go do a lot of stuff for in qsearch, you sure can't assume that this slot is still opened to memory. So the WRITE is in 90% of the cases very expensive. So the faster the cpu the higher the % you lose to latency of RAM. We're talking about 10% for sure at your dual machine. When measured for specint at a single cpu machine it was already nearly 10% of system time. As we know latency of a dual machine is quite something slower. Not comparable with quads. Best regards, Vincent >>Which means _every_ node is likely going to >>see a cache miss for that particular memory read. The entire point of the >>Zobrist hashing scheme is that a single change on the chess board produces a >>wild change in the hash signature, sending us to a different page of memory >>and the resulting hash miss. If you use 450ns for a miss, you get one of >>those _every_ node. What about the _other_ cache misses that happen? one >>more and you are _done_ as you just exhausted the entire 1usec of time you >>have. Conclusion? it isn't taking _that_ long. >> > >A simple experiment is to run bench with the smallest and largest possible hash >sizes and compare the raw node times. The difference should be largely >attrubutable to memory latency. You may want to run it several times to see how >much variability there is between runs. > >[snip]
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.