Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Real data on cache working set

Author: Robert Hyatt

Date: 20:26:28 08/23/04

Go up one level in this thread


On August 23, 2004 at 22:10:46, Tom Kerrigan wrote:

>On August 23, 2004 at 17:06:25, Robert Hyatt wrote:
>
>>I hate to cloud all the disinformation here with real data, but sometimes it
>>does tend to shed light on a topic that gets talked about with no supporting
>>data of any kind.
>
>Your data may be real but it's still crap. Come on, 16MB of L1 cache with 16
>byte lines and apparently no set associativity? It's like you went out of your
>way to simulate something as far removed from reality as possible.

I made 16 byte lines to make cache as efficient as possible.  Longer lines make
it _less_ efficient except for pre-fetching, which is _not_ what this is about.
Set associativity was set to 16 for no good reason other than that is what my
PIV has in L2.  Short lines is best for cache line utilization.  Long lines
favors serial reads where pre-fetching helps.  But my goal was to get to the
point of _minimum_ cache misses, where pre-fetching becomes a moot point.  I
wanted to know how much L1I cache was needed to drop I fetches to near zero.
That gives an upper/lower bound on working set size.  Ditto for data.

The point is _not_ the cache hits/misses.  It is what size of cache is needed to
minimize cache misses, period.  As _that_ gives a very good estimate of the
working set size...




>
>First, no set associativity means that you thrash more, esp. because of random
>hash probes. And you could be tallying up thousands of cache misses just because
>your program frequently accesses two ints of data that are at unlucky memory
>addresses (more likely with smaller cache sizes). So does that add 8 bytes or
>hundreds of bytes to your working set? Impossible to tell from your "real data."


Where do you get "no set associativity"?  I didn't even mention it.  But it
_was_ 16-way.  If you think about what I ran, and the "point" I picked to
estimate WSS, you'd realize your above nit-pick was moot.




>
>Second, by setting the cache lines so small, there's no way for us to tell if
>all those cache misses are Crafty accessing its working set at random or Crafty
>doing hash table read/writes.


with 64 K of hash _total_.  Give me a break...  It _isn't_ random hash
reads/writes, obviously.  That was why I ran with minumum hash size to limit
hash table cache usage.  Again, I ran with 64K bytes of _total_ hash space.
64K.  Not 64M.  That is how I drove the miss rate to near nothing, except for
the initial cache fills.






>
>The proper way to do this experiment is to set the simulator to something
>realistic, like 256/512/1024 KB of unified 16-way set associative (LRU eviction)
>64 byte per line cache, like you find in AMD processors.
>


It won't do unified in L1.  I wanted specifically to find instruction size and
data size.  This did exactly that, pretty reasonably.  I even ran with _fully
associative_ but it made no significant difference, hence my lack of interest.
Once you stop cache misses, which I did, associativity is meaningless.



>Run the simulation for 1 minute, then 2 minutes. Take delta cache misses, divide
>by delta nodes searched, and that gives you misses per node after the cache is
>warmed. When the misses stay relatively constant between cache sizes, you've
>found the working set size.

What are you talking about now?  What I reported was this:

1.  I started crafty, set up the position, terminated the program, and got the
cache stats.  I then did the same thing except that I ran each position for 10
seconds, 60 seconds total.  I reported the difference between the second run and
the first run, to eliminate the major initialization.  To clarify, there is a
mandatory number of cache misses, because stuff has to be filled in once even if
cache is 16X too big.  Those counts are included in the total.  There are a
bunch of cache misses for the initialization stuff that I factored out, although
I am not sure that was reasonable.  But it was a small number anyway and
everything that I initialize I use later anyway (mainly attack tables, etc.)
There is some start-of-search stuff that I can't eliminate which is why I said
part of the data has some initialization stuff in it.  IE root.c is only done
once, and is not a part of the actual search.  But it is included.  Opening
files, initializing hash, etc was _not_ included as the first run accounted for
that and I subtracted that from the total run to exclude it.  Could I have done
any better now that you know how I ran it?  I don't see how.

Of course it is _always_ easy to either criticize or guess.  I tried to offer
_real_ data.  This little program will allow a lot of tuning, although I wanted
to set the line size to 8, which is actually what gets read/written from memory,
but it won't go below 16 for some oddball reason.  Set associativity can range
from 1 (direct mapped) to N (where N = number of lines, making it fully
associative).  You pick it, I'll run it if you will feel better...

There is also a unified L2 cache, but it perfectly matched with the L1D + L1I
numbers and I thought it more interesting to have separate numbers for code and
data...

add 'em up if you prefer a unified number.

I'm not planning on writing a paper on this, so I'm not going to waste days on
the data.  I came up with some numbers that at _least_ are based on some sort of
reasonable approach to measuring cache usage.  And it is a third data point that
directly supports my 512K/1024K/2048K data and Eugene's 1.5m/3.0m data.  We both
show speed improvements as cache gets bigger, beyond 512kb.  So did the
cachegrind program above.  three out of four showing improvement is fairly
convincing.  one out of four seems to be an anomoly of some sort, as yet
unexplained.

Note that inclusive vs exclusive is not an issue since I ignored L2 completely
(I don't believe the cachegrind program understands exclusive but I did not look
at the source to see).

It might could be done better. But it is _definitely_ something far better than
what we had 24 hours ago to discuss.  Current crafty.  Known hash size.  Known
test set.  Etc...


>
>-Tom



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.