Computer Chess Club Archives




Subject: Re: Source code to measure it - there is something wrong

Author: Vincent Diepeveen

Date: 05:12:43 07/16/03

Go up one level in this thread

On July 16, 2003 at 08:04:49, Gerd Isenberg wrote:

>On July 15, 2003 at 20:26:28, Vincent Diepeveen wrote:
>>On July 15, 2003 at 17:58:01, Gerd Isenberg wrote:
>>>Ok, i think there is one problem with Vincent's cache benchmark.
>>>There are two similar functions DoNrng and DoNreads. DoNrng is used to mesure
>>>the time without hashread. But the instructions has the potential of faster
>>>execution due to less dependencies and stalls. It may execute parts of two loop
>>>bodies of DoNrng interlaced or simultaniesly - that is not possible in DoNreads.
>>>Therefore the time for N DoNrng is not the time used inside the N DoNrng loop,
>>>and maybe much faster.
>>If you look to the generated assembly code you will see that nothing is wrong.
>>If something goes wrong in the pairing of the instructions that might at most
>>make a difference of 1 to 2 clockcycles.
>>At 2Ghz that's 0.5 nanoseconds.
>>It is cool if it measures accurate in 0.5 ns of course, but that was never the
>>intention of the test. It was intended to measure latencies at big
>>supercomputers when n processors read in the memory of just 1 poor node.
>>the other test latencyC.c is doing a criss cross reference. A 0.5 ns error in
>>the measurement there (assuming there is no other system stuff disturbing which
>>there is because actually other processes may also use RAM at your node, which
>>can give major differences, but still doesn't give a bug in the test) is not
>>really relevant.
>>If you care for 0.5 ns difference then shoot me, but don't ever say it's 130ns
>>like bob claims for 133Mhz DDR ram.
>>You're at new fresh i bet cl2 150Mhz DDR ram if not faster.
>>the 280 ns was measured for P4 with 533 bus or 133Mhz DDR ram and the 400 ns as
>>you see is from the duals. For me 380 ns goes to 400 ns or something when using
>>bigger hashtables. I see no significant difference between 380 and 400 so to
>>speak. It's simply *huge* random latency.
>Hi Vincent,
>i tried this one ( AllocateHash before) in DoNrng doing some cyclic hash reads
>with index 0..3:
>int DoNrng(BITBOARD n) {
>  BITBOARD i=1,dummyres,nents;
>  int t1,t2;
>  nents = nentries; /* hopefully this gets into a register */
>  dummyres = globaldummy;
>  t1 = GetClock();
>  do {
>    BITBOARD index = RanrotA()%nents;
>    dummyres ^= index ^ hashtable[index&3];
>  } while( i++ < n );
>  t2 = GetClock();
>  globaldummy = dummyres;
>  return(t2-t1);
>Here the latency is about 25ns less on a P4 with 256MB only ;-(
>180ns instead of 205ns.
>Paul Hsieh's comment on Oliver's first approach may also address your approach:
>The 64-bit mod is very heavy, maybe it's better to use power of two tables and
>use "and"  2^N - 1.

Remember this test is intended for 64 bits platforms not for outdated IA32.
At my K7 i also saw that the 64 bit mod was very slow.

I do not know the number of cycles that 64 bit mod costs but it was trivial that
at each cpu it would cost a different number of clock cycles. The loop is even
more expensive. Up to 65 ns is no exception there. However it gets
first measured and after that the latency gets *reduced* by the costs of that
64 bits mod.

That is a very correct way of doing it.

The bad thing about the test is the habit of the RNG. I deliberately took a RNG
and didn't do a serial way of measuring like x+128 bytes each time, because
whatever algorithm you use, it is easy for processors to predict so.

Paul Hsieh comments regarding the linked list also applies to this test, because
the value gets xored. The adress at which the processor is going to read the
value is not known in advance, because it still has to get generated by the RNG,
so that's why this test is good at that.

What this test doesn't do is using a RNG where tests have been done using the

Note that everyone criticizes the MOD operation but they all miss that the time
this mod operation takes gets reduced from the time that the measurement took.

The only real weak chain is the RNG operation which sometimes might be big fun
for the L2 cache.

>Anyway, i found your test very instructive.

>Paul Hsieh's comment:
>You have a couple of problems:
>1) RDTSC often doesn't have enough accuracy/resolution to measure such
>a small number of clock cycles (indeed it varied from architecture to
>2) In an out-of-order architecture, the time of apparent execution of a
>single instruction cannot be measured in isolation.
>Usually the way to measure memory latency is build a linked list, and
>simply traverse that list, and measure the number of "traversals per
>second" or the number of clocks it takes to traverse a linked list of
>length n.
>The reason this is a good measure of memory latency is because when
>traversing a linked list, there is no way to proceed to the next link
>without first getting the result of the previous link traversal.
>Out-of-order execution is not possible, thus leading to a completely
>serialized execution stream.
>Usually you heavily unroll the loop and write it in assembly language to
>make sure there is nothing else affecting the timing.

This page took 0.02 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.