Computer Chess Club Archives




Subject: Re: Source code to measure it - there is something wrong

Author: Gerd Isenberg

Date: 05:04:49 07/16/03

Go up one level in this thread

On July 15, 2003 at 20:26:28, Vincent Diepeveen wrote:

>On July 15, 2003 at 17:58:01, Gerd Isenberg wrote:
>>Ok, i think there is one problem with Vincent's cache benchmark.
>>There are two similar functions DoNrng and DoNreads. DoNrng is used to mesure
>>the time without hashread. But the instructions has the potential of faster
>>execution due to less dependencies and stalls. It may execute parts of two loop
>>bodies of DoNrng interlaced or simultaniesly - that is not possible in DoNreads.
>>Therefore the time for N DoNrng is not the time used inside the N DoNrng loop,
>>and maybe much faster.
>If you look to the generated assembly code you will see that nothing is wrong.
>If something goes wrong in the pairing of the instructions that might at most
>make a difference of 1 to 2 clockcycles.
>At 2Ghz that's 0.5 nanoseconds.
>It is cool if it measures accurate in 0.5 ns of course, but that was never the
>intention of the test. It was intended to measure latencies at big
>supercomputers when n processors read in the memory of just 1 poor node.
>the other test latencyC.c is doing a criss cross reference. A 0.5 ns error in
>the measurement there (assuming there is no other system stuff disturbing which
>there is because actually other processes may also use RAM at your node, which
>can give major differences, but still doesn't give a bug in the test) is not
>really relevant.
>If you care for 0.5 ns difference then shoot me, but don't ever say it's 130ns
>like bob claims for 133Mhz DDR ram.
>You're at new fresh i bet cl2 150Mhz DDR ram if not faster.
>the 280 ns was measured for P4 with 533 bus or 133Mhz DDR ram and the 400 ns as
>you see is from the duals. For me 380 ns goes to 400 ns or something when using
>bigger hashtables. I see no significant difference between 380 and 400 so to
>speak. It's simply *huge* random latency.

Hi Vincent,

i tried this one ( AllocateHash before) in DoNrng doing some cyclic hash reads
with index 0..3:

int DoNrng(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2;

  nents = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD index = RanrotA()%nents;
    dummyres ^= index ^ hashtable[index&3];
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;

Here the latency is about 25ns less on a P4 with 256MB only ;-(
180ns instead of 205ns.

Paul Hsieh's comment on Oliver's first approach may also address your approach:
The 64-bit mod is very heavy, maybe it's better to use power of two tables and
use "and"  2^N - 1.

Anyway, i found your test very instructive.


Paul Hsieh's comment:

You have a couple of problems:
1) RDTSC often doesn't have enough accuracy/resolution to measure such
a small number of clock cycles (indeed it varied from architecture to

2) In an out-of-order architecture, the time of apparent execution of a
single instruction cannot be measured in isolation.


Usually the way to measure memory latency is build a linked list, and
simply traverse that list, and measure the number of "traversals per
second" or the number of clocks it takes to traverse a linked list of
length n.

The reason this is a good measure of memory latency is because when
traversing a linked list, there is no way to proceed to the next link
without first getting the result of the previous link traversal.
Out-of-order execution is not possible, thus leading to a completely
serialized execution stream.

Usually you heavily unroll the loop and write it in assembly language to
make sure there is nothing else affecting the timing.

This page took 0.04 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.