Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Another memory latency test

Author: Vincent Diepeveen
Date: 15:31:55 07/23/03
On July 22, 2003 at 15:58:00, J. Wesley Cleveland wrote:

>On July 22, 2003 at 14:27:45, Robert Hyatt wrote:
>
>>On July 22, 2003 at 12:28:39, J. Wesley Cleveland wrote:
>>
>>>On July 22, 2003 at 08:07:18, Gerd Isenberg wrote:
>>>
>>>>On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote:
>>>>
>>>>>On July 18, 2003 at 23:45:16, Robert Hyatt wrote:
>>>>>
>>>>>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote:
>>>>>>
>>>>>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote:
>>>>>>>>
>>>>>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote:
>>>>>>>>>
>>>>>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote:
>>>>>>>>>>
>>>>>>>>>[snip]
>>>>>>>>>>>
>>>>>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest
>>>>>>>>>>>stable source today, not the experimental version, available, too). If it would
>>>>>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you
>>>>>>>>>>>explained.
>>>>>>>>>>
>>>>>>>>>>4M pages solves it for at least 250mb worth of RAM.  But then again, _no_ chess
>>>>>>>>>>program depends on purely random memory accesses to blow out the TLB.  The only
>>>>>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which
>>>>>>>>>>both total to significantly less than the total nodes I search.  Which means
>>>>>>>>>>the TLB penalty is not even 1% of my total run time.  Probably closer to
>>>>>>>>>>.01% - .05%.
>>>>>>>>>>
>>>>>>>>>>I ignore that.
>>>>>>>>>
>>>>>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with
>>>>>>>>>crafty.
>>>>>>>>
>>>>>>>>
>>>>>>>>Look at the code.
>>>>>>>I not only looked at the code. I *instrumented it*. I won't have complete
>>>>>>>results until Monday, but it appears that crafty spends 3-5% of its total time
>>>>>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about
>>>>>>>half.
>>>>>>>
>>>>>>>>Crafty probes memory _once_ for a hash probe.  That
>>>>>>>>introduces a memory access penalty once per node in the basic search,
>>>>>>>>less than once per node in the q-search (I only probe phash there and I
>>>>>>>>don't probe it but about 25% of the q-search nodes I visit).
>>>>>>>
>>>>>>>If you had read whai I wrote, you would see I said crafty does a hash probe
>>>>>>>20-30% of its total nodes.
>>>>>>
>>>>>>OK.  I clearly mis-read what you meant.  the 20-30% was eye-catching as that
>>>>>>is a pretty common hash hit percentage as well...
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>As a result, you get less than one probe per node searched.  A node searched
>>>>>>>>requires something on the order of 3000-5000 instructions.  What percentage
>>>>>>>>of that 3K-5K instruction timing is that single hash probe?  Almost zero.
>>>>>>>
>>>>>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache
>>>>>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0.
>>>>>>
>>>>>>You are missing my point.  In the position(s) you tested, you saw 20-30%
>>>>>>hash probes.  That means one probe for every 3-5 nodes.  At 1M nodes
>>>>>>per second, that is 200K-300K probes per second.  If you measure the
>>>>>>time spent in searching a single node, multiply that by 3-5X, then compare
>>>>>>that to the hash probe time, the time spent probing the hash table is low.
>>>>>>
>>>>>>Note that your 5% is _not_ the total time used to probe the table.  It is
>>>>>>the time to probe the table, and do it _twice_ although the second probe
>>>>>>doesn't have any memory access penalty associated with it in most cases.
>>>>>>
>>>>>>So a big percent of that 5% is doing the actual work done in HashProbe(),
>>>>>>rather than being all memory access penalty...
>>>>>
>>>>>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test
>>>>>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k
>>>>>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the
>>>>>following code just before the call to RepetitionCheck() in search.c (slightly
>>>>>modified from the code in hash.c). Note that the code is basically a no-op as
>>>>>all variables are local.
>>>>>
>>>>>{
>>>>>  static BITBOARD word1;
>>>>>  BITBOARD temp_hashkey;
>>>>>  HASH_ENTRY *htable;
>>>>>/*
>>>>> ----------------------------------------------------------
>>>>>|                                                          |
>>>>>|   first, compute the initial hash address and choose     |
>>>>>|   which hash table (based on color) to probe.            |
>>>>>|                                                          |
>>>>> ----------------------------------------------------------
>>>>>*/
>>>>>
>>>>>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>>  word1=htable->word1;
>>>>>}
>>>>>
>>>>>Now crafty spends 2.8% of its time in HashProbe.
>>>>
>>>>Hi Wesley,
>>>>
>>>>that's interesting, it seems that preloading decreases the hash-latency.
>>>>May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing
>>>>L2-Cache) is even better.
>>>>
>>>>Gerd
>>>
>>>I'm sure it would be better. My code doesn't make it run any faster, it just
>>>shows that the delay due to memory access is significant.
>>>
>>
>>Can you tell me how you conclude this?
>
>The *only* effect of the code I added is to ensure that the depth-preferred part
>of the hash table is put into cache, so any speedup in HashProbe is due to not
>having a cache (and ATB) miss.
>
>>
>>IE there are two parts in HashProbe();
>>
>>1.  probe "depth-preferred table".
>>
>>2.  probe "always-store" table".
>>
>>You are assuming that of the total 3.6% done in HashProbe(), that .8% is
>>done in the always-store code.  Which means that .8% is done in the depth-
>>preferred table, and the remaining time is memory latency.
>>
>>I don't think that is the explanation.
>>
>>Suppose _many_ hits occur in the depth-preferred table.  Then you won't be
>>probing the always-store table at those positions.  And your .8% assumption
>>is not so safe to make.  Unless you run huge searches with a small table,
>>this effect will distort any possible conclusions.
>>
>>
>>No way a single random access memory read is 3% of the total time spent
>>doing a node.  There are way too many _other_ random-access reads done in
>>crafty to make that possible.  The total time would go over 100%.
>
>At 1M nodes/sec, the time for 1 node is (obviously) 1 usec. The latency for one
>cache miss is about 150 nsec. This implies that if you have *one* cache miss
>every 4 nodes, you will spend 3% on that single random access memory read.
>Apparently, caching works very well for crafty, except for HashProbe( ;).

At bob's quad the latency for a miss into random memory not yet opened is
about 400ns, not 150ns.

Best regards,
Vincent

>Again, my figures are on my slow machine. Your machine is ~6x faster, while your
>memory latency is not much better, so I suspect the figures will be much worse
>on your machine. You may want to test this for yourself.
>
>>
>>>>
>>>>
>>>>{
>>>>  static BITBOARD word1;
>>>>  BITBOARD temp_hashkey;
>>>>  HASH_ENTRY *htable;
>>>>/*
>>>> ----------------------------------------------------------
>>>>|                                                          |
>>>>|   first, compute the initial hash address and choose     |
>>>>|   which hash table (based on color) to probe.            |
>>>>|                                                          |
>>>> ----------------------------------------------------------
>>>>*/
>>>>
>>>>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>>>>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>>>>#ifdef _DOPREFETCH
>>>>  __asm mov eax, [htable]; // get the pointer
>>>>  __asm PREFETCHNTA [eax]; // fetch to L1-cache, bypassing L2-Cache
>>>>#else
>>>>  word1=htable->word1;
>>>>#endif
>>>>}
>>>>
>>>>some additional notes from:
>>>>
>>>>"AMD Athlon™ Processor x86 Code Optimization Guide"
>>>>
>>>>Prefetching versus Preloading
>>>>
>>>>In code that uses the block prefetch technique as described in
>>>>“Optimizing Main Memory Performance for Large Arrays” on page 66, a standard
>>>>load instruction is the best way to prefetch data. But in other situations, load
>>>>instructions may be able to mimic the functionality of prefetch instructions,
>>>>but they do not offer the same performance advantage.Prefetch instructions only
>>>>update the cache line in the L1/L2 cache and do not update an architectural
>>>>register. This uses one less register compared to a load instruction. Prefetch
>>>>instructions also do not cause
>>>>normal instruction retirement to stall. Another benefit of prefetching versus
>>>>preloading is that the prefetching instructions can retire even if the load data
>>>>has not arrived yet. A regular load used for preloading will stall the machine
>>>>if it gets to the bottom of the fixed-issue reorder buffer (part of the
>>>>Instruction Control Unit) and the load data has not arrived yet. The load is
>>>>"blocking" whereas the prefetch is "non-blocking."
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.