Author: Vincent Diepeveen
Date: 15:32:23 07/23/03
Go up one level in this thread
On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote: >On July 22, 2003 at 14:27:45, Robert Hyatt wrote: > >>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote: >> >>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote: >>> >>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote: >>>> >>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote: >>>>> >>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote: >>>>>> >>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote: >>>>>>> >>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote: >>>>>>>> >>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote: >>>>>>>>> >>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote: >>>>>>>>>> >>>>>>>>>[snip] >>>>>>>>>>> >>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest >>>>>>>>>>>stable source today, not the experimental version, available, too). If it would >>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you >>>>>>>>>>>explained. >>>>>>>>>> >>>>>>>>>>4M pages solves it for at least 250mb worth of RAM. But then again, _no_ chess >>>>>>>>>>program depends on purely random memory accesses to blow out the TLB. The only >>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which >>>>>>>>>>both total to significantly less than the total nodes I search. Which means >>>>>>>>>>the TLB penalty is not even 1% of my total run time. Probably closer to >>>>>>>>>>.01% - .05%. >>>>>>>>>> >>>>>>>>>>I ignore that. >>>>>>>>> >>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with >>>>>>>>>crafty. >>>>>>>> >>>>>>>> >>>>>>>>Look at the code. >>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete >>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time >>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about >>>>>>>half. >>>>>>> >>>>>>>>Crafty probes memory _once_ for a hash probe. That >>>>>>>>introduces a memory access penalty once per node in the basic search, >>>>>>>>less than once per node in the q-search (I only probe phash there and I >>>>>>>>don't probe it but about 25% of the q-search nodes I visit). >>>>>>> >>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe >>>>>>>20-30% of its total nodes. >>>>>> >>>>>>OK. I clearly mis-read what you meant. the 20-30% was eye-catching as that >>>>>>is a pretty common hash hit percentage as well... >>>>>> >>>>>> >>>>>>> >>>>>>>>As a result, you get less than one probe per node searched. A node searched >>>>>>>>requires something on the order of 3000-5000 instructions. What percentage >>>>>>>>of that 3K-5K instruction timing is that single hash probe? Almost zero. >>>>>>> >>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache >>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0. >>>>>> >>>>>>You are missing my point. In the position(s) you tested, you saw 20-30% >>>>>>hash probes. That means one probe for every 3-5 nodes. At 1M nodes >>>>>>per second, that is 200K-300K probes per second. If you measure the >>>>>>time spent in searching a single node, multiply that by 3-5X, then compare >>>>>>that to the hash probe time, the time spent probing the hash table is low. >>>>>> >>>>>>Note that your 5% is _not_ the total time used to probe the table. It is >>>>>>the time to probe the table, and do it _twice_ although the second probe >>>>>>doesn't have any memory access penalty associated with it in most cases. >>>>>> >>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(), >>>>>>rather than being all memory access penalty... >>>>> >>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test >>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k >>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the >>>>>following code just before the call to RepetitionCheck() in search.c (slightly >>>>>modified from the code in hash.c). Note that the code is basically a no-op as >>>>>all variables are local. >>>>> >>>>>{ >>>>> static BITBOARD word1; >>>>> BITBOARD temp_hashkey; >>>>> HASH_ENTRY *htable; >>>>>/* >>>>> ---------------------------------------------------------- >>>>>| | >>>>>| first, compute the initial hash address and choose | >>>>>| which hash table (based on color) to probe. | >>>>>| | >>>>> ---------------------------------------------------------- >>>>>*/ >>>>> >>>>> temp_hashkey=(wtm) ? HashKey : ~HashKey; >>>>> htable=trans_ref_a+((int) temp_hashkey&hash_maska); >>>>> word1=htable->word1; >>>>>} >>>>> >>>>>Now crafty spends 2.8% of its time in HashProbe. >>>> >>>>Hi Wesley, >>>> >>>>that's interesting, it seems that preloading decreases the hash-latency. >>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing >>>>L2-Cache) is even better. >>>> >>>>Gerd >>> >>>I'm sure it would be better. My code doesn't make it run any faster, it just >>>shows that the delay due to memory access is significant. >>> >> >>Can you tell me how you conclude this? > >The *only* effect of the code I added is to ensure that the depth-preferred part >of the hash table is put into cache, so any speedup in HashProbe is due to not >having a cache (and ATB) miss. > >> >>IE there are two parts in HashProbe(); >> >>1. probe "depth-preferred table". >> >>2. probe "always-store" table". >> >>You are assuming that of the total 3.6% done in HashProbe(), that .8% is >>done in the always-store code. Which means that .8% is done in the depth- >>preferred table, and the remaining time is memory latency. >> >>I don't think that is the explanation. >> >>Suppose _many_ hits occur in the depth-preferred table. Then you won't be >>probing the always-store table at those positions. And your .8% assumption >>is not so safe to make. Unless you run huge searches with a small table, >>this effect will distort any possible conclusions. >> >> >>No way a single random access memory read is 3% of the total time spent >>doing a node. There are way too many _other_ random-access reads done in >>crafty to make that possible. The total time would go over 100%. > >At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one >cache miss is about 150 nsec. This implies that if you have *one* cache miss >every 4 nodes, you will spend 3% on that single random access memory read. >Apparently, caching works very well for crafty, except for HashProbe( ;). ==> that should be dual At bob's dual the latency for a miss into random memory not yet opened is about 400ns, not 150ns. Best regards, Vincent >Again, my figures are on my slow machine. Your machine is ~6x faster, while your >memory latency is not much better, so I suspect the figures will be much worse >on your machine. You may want to test this for yourself. > >> >>>> >>>> >>>>{ >>>> static BITBOARD word1; >>>> BITBOARD temp_hashkey; >>>> HASH_ENTRY *htable; >>>>/* >>>> ---------------------------------------------------------- >>>>| | >>>>| first, compute the initial hash address and choose | >>>>| which hash table (based on color) to probe. | >>>>| | >>>> ---------------------------------------------------------- >>>>*/ >>>> >>>> temp_hashkey=(wtm) ? HashKey : ~HashKey; >>>> htable=trans_ref_a+((int) temp_hashkey&hash_maska); >>>>#ifdef _DOPREFETCH >>>> __asm mov eax, [htable]; // get the pointer >>>> __asm PREFETCHNTA [eax]; // fetch to L1-cache, bypassing L2-Cache >>>>#else >>>> word1=htable->word1; >>>>#endif >>>>} >>>> >>>>some additional notes from: >>>> >>>>"AMD Athlon™ Processor x86 Code Optimization Guide" >>>> >>>>Prefetching versus Preloading >>>> >>>>In code that uses the block prefetch technique as described in >>>>“Optimizing Main Memory Performance for Large Arrays” on page 66, a standard >>>>load instruction is the best way to prefetch data. But in other situations, load >>>>instructions may be able to mimic the functionality of prefetch instructions, >>>>but they do not offer the same performance advantage.Prefetch instructions only >>>>update the cache line in the L1/L2 cache and do not update an architectural >>>>register. This uses one less register compared to a load instruction. Prefetch >>>>instructions also do not cause >>>>normal instruction retirement to stall. Another benefit of prefetching versus >>>>preloading is that the prefetching instructions can retire even if the load data >>>>has not arrived yet. A regular load used for preloading will stall the machine >>>>if it gets to the bottom of the fixed-issue reorder buffer (part of the >>>>Instruction Control Unit) and the load data has not arrived yet. The load is >>>>"blocking" whereas the prefetch is "non-blocking."
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.