Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Another memory latency test

Author: Robert Hyatt

Date: 15:25:51 07/17/03

Go up one level in this thread


On July 17, 2003 at 17:35:33, Dieter Buerssner wrote:

>On July 17, 2003 at 17:05:00, Robert Hyatt wrote:
>
>>On July 17, 2003 at 09:16:10, Dieter Buerssner wrote:
>>
>>>I use an inner loop, that just translates to a stream of move memory to register
>>>instructions (one for each access). Here are some results (source at the end of
>>>the posting, not well tested, please report the errors/flaws ...)
>>>
>>
>>The main flaw is you are not testing "memory latency" here.  If you look at
>>how the X86 does virtual memory, it is a two-level memory lookup.  To avoid
>>this penalty, the TLB holds recent virtual-to-real address mapings, but the
>>TLB is not huge.  On my dual xeon, lm-bench reports that the TLB holds the
>>most recent 62 virtual-to-real translations.  What you are measuring is
>>at least _two_ memory latency cycles, one or two to do the virtual to real
>>address translation, then another to actually fetch the data.
>
>Please see also my answer to Gerd in this thread. And, you might have recognized
>(from the former thread), that I was a bit aware of this (I was the first one,
>to mention virtual memory at all).
>
>From your former posts:
>---
>Bob:
>No.  lm-bench does _random_ reads and computes the _random-access_
>latency.

That is correct.  The "sequential read latency" is what happens when you
read successive bytes.  So long as you don't access cache, which a stride
of 128 addresses, you are getting random access latencies.  That was the
point.

There are several ways to bypass the TLB problem.  I noticed your reference
to "virtual memory" but it seemed to be in the context of "disk paging" only.
My reference was to the penalty associated with doing virtual to real address
translations.  But nobody calls that "memory latency" when you _know_ you have
to do two or three memory accesses in such cases, for those machines that have
this problem (not all do).

You can bypass the TLB problem by not running on that kind of machine.  I
recommend any Cray in the Cray-1 family, all the way through the T90.  That
is a solution.

Your O/S can switch the X86 to 4M page size.  With any luck, the TLB misses
now go to near zero.  And you end up with one memory access per memory
reference.  And you see the raw latency that lm-bench gives.  _everybody_ should
know about virtual memory hardware, paging tables, the MMU, the TLB, etc, when
talking about architecture.  And _nobody_ throws that noise into the discussion
when discussing "memory latency".

>---
>
>I cannot find any randomness in the reads of lm-bench (I downloaded latest
>stable source today, not the experimental version, available, too). If it would
>do random reads, it would have no way to avoid the problem with the TLBs you
>explained.

4M pages solves it for at least 250mb worth of RAM.  But then again, _no_ chess
program depends on purely random memory accesses to blow out the TLB.  The only
truly random accesses I do are the regular hashing and pawn hashing, which
both total to significantly less than the total nodes I search.  Which means
the TLB penalty is not even 1% of my total run time.  Probably closer to
.01% - .05%.

I ignore that.




>
>And (Bob is the uncited one):
>---
>>>Host                 OS   Mhz   L1 $   L2 $    Main mem    Guesses
>>>--------- -------------   ---   ----   ----    --------    -------
>>>scrappy    Linux 2.4.20   744 4.0370 9.4300       130.2
>>>
>>>>In the lmbench paper they have a nice graph like this.
>>>
>>>
>>>Is the above what you want?
>>
>>I think that it's as close as you're going to get. The most important thing is
>>that 130 [ns] is the largest number. And wouldn't that be a little bit
>>pessimistic even for chess hash tables?
>
>
>I don't think so, although, in the case of crafty, the actual latency is
>about 1/3 of that, since I read three positions and you would ammortize the
>latency over those three positions rather than just over one.
>---
>
>This also seems to imply, that you set latency equal to the value I and Vincent
>measured. In your current answer to my post, you seem to switch the context.

No.  I use the term "latency" to mean "latency".  Not "the amount of time
needed to both translate the virtual address (which is unknown at the time of
reference and which is also _variable_).  I've been consistent there.  And I
_still_ measure latency that way.  Most hardware books mention this very
clearly and typically say "a memory read to a random access can require from
zero to three memory cycles, depending on (zero) if it is in cache, all the
way to (three) if the TLB has nothing to help."  However, the zero to three
memory cycles is zero to three memory latency cycles.  Not one latency that can
go from zero to three cycles.  That would be worthless.

Vincent is the one that is using the term "memory latency" in the wrong way.

>
>Also, you may remember, that I suggested the scheme you are using in Crafty now,
>to you (the three contigious cells in HT instead of 2 tables of which one has
>the double size than the other one).

Sure.

>
>>To compute a _real_ raw memory latency number, you have to avoid overwriting
>>the TLB too badly.  Otherwise the latency is inflated by the MMU overhead
>>that isn't actually hit on "normal application" that badly.
>
>Sure, I don't doubt, that it is not the "real" latency what my code measures.
>But that number seems rather uninteresting from (chess engine) programmer's
>point of view. I guess many database applications use hashing schemes, and have
>similar random access latencies as chess engines. I of course can also imagine
>many applications, where this won't play a role (say a numerical calculation,
>where you typically access vectors in order, other numerical applications may
>have big jumps, too).
>
>Regards,
>Dieter


Databases don't worry about the latency.  The things read so much data, they
really measure cache-line-fill-time more than anything else.  Getting to a
single word is useless.  They want to get to a specific block.  There the TLB
misses don't count at all.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.