Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Another memory latency test

Author: Gerd Isenberg

Date: 05:07:18 07/22/03

Go up one level in this thread


On July 21, 2003 at 15:35:17, J. Wesley Cleveland wrote:

>On July 18, 2003 at 23:45:16, Robert Hyatt wrote:
>
>>On July 18, 2003 at 21:58:18, J. Wesley Cleveland wrote:
>>
>>>On July 18, 2003 at 21:17:14, Robert Hyatt wrote:
>>>
>>>>On July 18, 2003 at 15:21:35, J. Wesley Cleveland wrote:
>>>>
>>>>>On July 17, 2003 at 18:25:51, Robert Hyatt wrote:
>>>>>
>>>>>>On July 17, 2003 at 17:35:33, Dieter Buerssner wrote:
>>>>>>
>>>>>[snip]
>>>>>>>
>>>>>>>I cannot find any randomness in the reads of lm-bench (I downloaded latest
>>>>>>>stable source today, not the experimental version, available, too). If it would
>>>>>>>do random reads, it would have no way to avoid the problem with the TLBs you
>>>>>>>explained.
>>>>>>
>>>>>>4M pages solves it for at least 250mb worth of RAM.  But then again, _no_ chess
>>>>>>program depends on purely random memory accesses to blow out the TLB.  The only
>>>>>>truly random accesses I do are the regular hashing and pawn hashing, which
>>>>>>both total to significantly less than the total nodes I search.  Which means
>>>>>>the TLB penalty is not even 1% of my total run time.  Probably closer to
>>>>>>.01% - .05%.
>>>>>>
>>>>>>I ignore that.
>>>>>
>>>>>Why do you think it is that low? I get ~20-30% of nodes have hash probes with
>>>>>crafty.
>>>>
>>>>
>>>>Look at the code.
>>>I not only looked at the code. I *instrumented it*. I won't have complete
>>>results until Monday, but it appears that crafty spends 3-5% of its total time
>>>inside hashprobe on my (slow) machine and a prefetch could reduce that by about
>>>half.
>>>
>>>>Crafty probes memory _once_ for a hash probe.  That
>>>>introduces a memory access penalty once per node in the basic search,
>>>>less than once per node in the q-search (I only probe phash there and I
>>>>don't probe it but about 25% of the q-search nodes I visit).
>>>
>>>If you had read whai I wrote, you would see I said crafty does a hash probe
>>>20-30% of its total nodes.
>>
>>OK.  I clearly mis-read what you meant.  the 20-30% was eye-catching as that
>>is a pretty common hash hit percentage as well...
>>
>>
>>>
>>>>As a result, you get less than one probe per node searched.  A node searched
>>>>requires something on the order of 3000-5000 instructions.  What percentage
>>>>of that 3K-5K instruction timing is that single hash probe?  Almost zero.
>>>
>>>Except that a fast machine may do these 3-5K instructions in <1usec. A cache
>>>miss + a TLB miss may take 300-400 ns. I would not call 30% almost 0.
>>
>>You are missing my point.  In the position(s) you tested, you saw 20-30%
>>hash probes.  That means one probe for every 3-5 nodes.  At 1M nodes
>>per second, that is 200K-300K probes per second.  If you measure the
>>time spent in searching a single node, multiply that by 3-5X, then compare
>>that to the hash probe time, the time spent probing the hash table is low.
>>
>>Note that your 5% is _not_ the total time used to probe the table.  It is
>>the time to probe the table, and do it _twice_ although the second probe
>>doesn't have any memory access penalty associated with it in most cases.
>>
>>So a big percent of that 5% is doing the actual work done in HashProbe(),
>>rather than being all memory access penalty...
>
>I ran some tests on my slow (450 Mhz) machine. Hash was set to 192Mb. The test
>was 21 middle-game positions and ran for nearly 1 hour. Crafty got between 125k
>and 230k nps. Crafty spent 3.6% of total time in HashProbe. I added the
>following code just before the call to RepetitionCheck() in search.c (slightly
>modified from the code in hash.c). Note that the code is basically a no-op as
>all variables are local.
>
>{
>  static BITBOARD word1;
>  BITBOARD temp_hashkey;
>  HASH_ENTRY *htable;
>/*
> ----------------------------------------------------------
>|                                                          |
>|   first, compute the initial hash address and choose     |
>|   which hash table (based on color) to probe.            |
>|                                                          |
> ----------------------------------------------------------
>*/
>
>  temp_hashkey=(wtm) ? HashKey : ~HashKey;
>  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
>  word1=htable->word1;
>}
>
>Now crafty spends 2.8% of its time in HashProbe.

Hi Wesley,

that's interesting, it seems that preloading decreases the hash-latency.
May be prefetching with Athlons's/Opteron's/P4's PREFETCHNTA, (bypassing
L2-Cache) is even better.

Gerd


{
  static BITBOARD word1;
  BITBOARD temp_hashkey;
  HASH_ENTRY *htable;
/*
 ----------------------------------------------------------
|                                                          |
|   first, compute the initial hash address and choose     |
|   which hash table (based on color) to probe.            |
|                                                          |
 ----------------------------------------------------------
*/

  temp_hashkey=(wtm) ? HashKey : ~HashKey;
  htable=trans_ref_a+((int) temp_hashkey&hash_maska);
#ifdef _DOPREFETCH
  __asm mov eax, [htable]; // get the pointer
  __asm PREFETCHNTA [eax]; // fetch to L1-cache, bypassing L2-Cache
#else
  word1=htable->word1;
#endif
}

some additional notes from:

"AMD Athlon™ Processor x86 Code Optimization Guide"

Prefetching versus Preloading

In code that uses the block prefetch technique as described in
“Optimizing Main Memory Performance for Large Arrays” on page 66, a standard
load instruction is the best way to prefetch data. But in other situations, load
instructions may be able to mimic the functionality of prefetch instructions,
but they do not offer the same performance advantage.Prefetch instructions only
update the cache line in the L1/L2 cache and do not update an architectural
register. This uses one less register compared to a load instruction. Prefetch
instructions also do not cause
normal instruction retirement to stall. Another benefit of prefetching versus
preloading is that the prefetching instructions can retire even if the load data
has not arrived yet. A regular load used for preloading will stall the machine
if it gets to the bottom of the fixed-issue reorder buffer (part of the
Instruction Control Unit) and the load data has not arrived yet. The load is
"blocking" whereas the prefetch is "non-blocking."



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.