Author: J. Wesley Cleveland
Date: 09:28:39 07/22/03
Go up one level in this thread
On July 22, 2003 at 08:07:18, Gerd Isenberg wrote: >On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote: > >>On July 18, 2003 at 23:45:16, Robert Hyatt wrote: >> >>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote: >>> >>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote: >>>> >>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote: >>>>> >>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote: >>>>>> >>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote: >>>>>>> >>>>>>[snip] >>>>>>>> >>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest >>>>>>>>stable source today, not the experimental version, available, too). If it would >>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you >>>>>>>>explained. >>>>>>> >>>>>>>4M pages solves it for at least 250mb worth of RAM. But then again, _no_ chess >>>>>>>program depends on purely random memory accesses to blow out the TLB. The only >>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which >>>>>>>both total to significantly less than the total nodes I search. Which means >>>>>>>the TLB penalty is not even 1% of my total run time. Probably closer to >>>>>>>.01% - .05%. >>>>>>> >>>>>>>I ignore that. >>>>>> >>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with >>>>>>crafty. >>>>> >>>>> >>>>>Look at the code. >>>>I not only looked at the code. I *instrumented it*. I won't have complete >>>>results until Monday, but it appears that crafty spends 3-5% of its total time >>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about >>>>half. >>>> >>>>>Crafty probes memory _once_ for a hash probe. That >>>>>introduces a memory access penalty once per node in the basic search, >>>>>less than once per node in the q-search (I only probe phash there and I >>>>>don't probe it but about 25% of the q-search nodes I visit). >>>> >>>>If you had read whai I wrote, you would see I said crafty does a hash probe >>>>20-30% of its total nodes. >>> >>>OK. I clearly mis-read what you meant. the 20-30% was eye-catching as that >>>is a pretty common hash hit percentage as well... >>> >>> >>>> >>>>>As a result, you get less than one probe per node searched. A node searched >>>>>requires something on the order of 3000-5000 instructions. What percentage >>>>>of that 3K-5K instruction timing is that single hash probe? Almost zero. >>>> >>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache >>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0. >>> >>>You are missing my point. In the position(s) you tested, you saw 20-30% >>>hash probes. That means one probe for every 3-5 nodes. At 1M nodes >>>per second, that is 200K-300K probes per second. If you measure the >>>time spent in searching a single node, multiply that by 3-5X, then compare >>>that to the hash probe time, the time spent probing the hash table is low. >>> >>>Note that your 5% is _not_ the total time used to probe the table. It is >>>the time to probe the table, and do it _twice_ although the second probe >>>doesn't have any memory access penalty associated with it in most cases. >>> >>>So a big percent of that 5% is doing the actual work done in HashProbe(), >>>rather than being all memory access penalty... >> >>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test >>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k >>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the >>following code just before the call to RepetitionCheck() in search.c (slightly >>modified from the code in hash.c). Note that the code is basically a no-op as >>all variables are local. >> >>{ >> static BITBOARD word1; >> BITBOARD temp_hashkey; >> HASH_ENTRY *htable; >>/* >> ---------------------------------------------------------- >>| | >>| first, compute the initial hash address and choose | >>| which hash table (based on color) to probe. | >>| | >> ---------------------------------------------------------- >>*/ >> >> temp_hashkey=(wtm) ? HashKey : ~HashKey; >> htable=trans_ref_a+((int) temp_hashkey&hash_maska); >> word1=htable->word1; >>} >> >>Now crafty spends 2.8% of its time in HashProbe. > >Hi Wesley, > >that's interesting, it seems that preloading decreases the hash-latency. >May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing >L2-Cache) is even better. > >Gerd I'm sure it would be better. My code doesn't make it run any faster, it just shows that the delay due to memory access is significant. > > >{ > static BITBOARD word1; > BITBOARD temp_hashkey; > HASH_ENTRY *htable; >/* > ---------------------------------------------------------- >| | >| first, compute the initial hash address and choose | >| which hash table (based on color) to probe. | >| | > ---------------------------------------------------------- >*/ > > temp_hashkey=(wtm) ? HashKey : ~HashKey; > htable=trans_ref_a+((int) temp_hashkey&hash_maska); >#ifdef _DOPREFETCH > __asm mov eax, [htable]; // get the pointer > __asm PREFETCHNTA [eax]; // fetch to L1-cache, bypassing L2-Cache >#else > word1=htable->word1; >#endif >} > >some additional notes from: > >"AMD Athlon™ Processor x86 Code Optimization Guide" > >Prefetching versus Preloading > >In code that uses the block prefetch technique as described in >“Optimizing Main Memory Performance for Large Arrays” on page 66, a standard >load instruction is the best way to prefetch data. But in other situations, load >instructions may be able to mimic the functionality of prefetch instructions, >but they do not offer the same performance advantage.Prefetch instructions only >update the cache line in the L1/L2 cache and do not update an architectural >register. This uses one less register compared to a load instruction. Prefetch >instructions also do not cause >normal instruction retirement to stall. Another benefit of prefetching versus >preloading is that the prefetching instructions can retire even if the load data >has not arrived yet. A regular load used for preloading will stall the machine >if it gets to the bottom of the fixed-issue reorder buffer (part of the >Instruction Control Unit) and the load data has not arrived yet. The load is >"blocking" whereas the prefetch is "non-blocking."
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.