Author: Robert Hyatt
Date: 20:26:28 08/23/04
Go up one level in this thread
On August 23, 2004 at 22:10:46, Tom Kerrigan wrote: >On August 23, 2004 at 17:06:25, Robert Hyatt wrote: > >>I hate to cloud all the disinformation here with real data, but sometimes it >>does tend to shed light on a topic that gets talked about with no supporting >>data of any kind. > >Your data may be real but it's still crap. Come on, 16MB of L1 cache with 16 >byte lines and apparently no set associativity? It's like you went out of your >way to simulate something as far removed from reality as possible. I made 16 byte lines to make cache as efficient as possible. Longer lines make it _less_ efficient except for pre-fetching, which is _not_ what this is about. Set associativity was set to 16 for no good reason other than that is what my PIV has in L2. Short lines is best for cache line utilization. Long lines favors serial reads where pre-fetching helps. But my goal was to get to the point of _minimum_ cache misses, where pre-fetching becomes a moot point. I wanted to know how much L1I cache was needed to drop I fetches to near zero. That gives an upper/lower bound on working set size. Ditto for data. The point is _not_ the cache hits/misses. It is what size of cache is needed to minimize cache misses, period. As _that_ gives a very good estimate of the working set size... > >First, no set associativity means that you thrash more, esp. because of random >hash probes. And you could be tallying up thousands of cache misses just because >your program frequently accesses two ints of data that are at unlucky memory >addresses (more likely with smaller cache sizes). So does that add 8 bytes or >hundreds of bytes to your working set? Impossible to tell from your "real data." Where do you get "no set associativity"? I didn't even mention it. But it _was_ 16-way. If you think about what I ran, and the "point" I picked to estimate WSS, you'd realize your above nit-pick was moot. > >Second, by setting the cache lines so small, there's no way for us to tell if >all those cache misses are Crafty accessing its working set at random or Crafty >doing hash table read/writes. with 64 K of hash _total_. Give me a break... It _isn't_ random hash reads/writes, obviously. That was why I ran with minumum hash size to limit hash table cache usage. Again, I ran with 64K bytes of _total_ hash space. 64K. Not 64M. That is how I drove the miss rate to near nothing, except for the initial cache fills. > >The proper way to do this experiment is to set the simulator to something >realistic, like 256/512/1024 KB of unified 16-way set associative (LRU eviction) >64 byte per line cache, like you find in AMD processors. > It won't do unified in L1. I wanted specifically to find instruction size and data size. This did exactly that, pretty reasonably. I even ran with _fully associative_ but it made no significant difference, hence my lack of interest. Once you stop cache misses, which I did, associativity is meaningless. >Run the simulation for 1 minute, then 2 minutes. Take delta cache misses, divide >by delta nodes searched, and that gives you misses per node after the cache is >warmed. When the misses stay relatively constant between cache sizes, you've >found the working set size. What are you talking about now? What I reported was this: 1. I started crafty, set up the position, terminated the program, and got the cache stats. I then did the same thing except that I ran each position for 10 seconds, 60 seconds total. I reported the difference between the second run and the first run, to eliminate the major initialization. To clarify, there is a mandatory number of cache misses, because stuff has to be filled in once even if cache is 16X too big. Those counts are included in the total. There are a bunch of cache misses for the initialization stuff that I factored out, although I am not sure that was reasonable. But it was a small number anyway and everything that I initialize I use later anyway (mainly attack tables, etc.) There is some start-of-search stuff that I can't eliminate which is why I said part of the data has some initialization stuff in it. IE root.c is only done once, and is not a part of the actual search. But it is included. Opening files, initializing hash, etc was _not_ included as the first run accounted for that and I subtracted that from the total run to exclude it. Could I have done any better now that you know how I ran it? I don't see how. Of course it is _always_ easy to either criticize or guess. I tried to offer _real_ data. This little program will allow a lot of tuning, although I wanted to set the line size to 8, which is actually what gets read/written from memory, but it won't go below 16 for some oddball reason. Set associativity can range from 1 (direct mapped) to N (where N = number of lines, making it fully associative). You pick it, I'll run it if you will feel better... There is also a unified L2 cache, but it perfectly matched with the L1D + L1I numbers and I thought it more interesting to have separate numbers for code and data... add 'em up if you prefer a unified number. I'm not planning on writing a paper on this, so I'm not going to waste days on the data. I came up with some numbers that at _least_ are based on some sort of reasonable approach to measuring cache usage. And it is a third data point that directly supports my 512K/1024K/2048K data and Eugene's 1.5m/3.0m data. We both show speed improvements as cache gets bigger, beyond 512kb. So did the cachegrind program above. three out of four showing improvement is fairly convincing. one out of four seems to be an anomoly of some sort, as yet unexplained. Note that inclusive vs exclusive is not an issue since I ignored L2 completely (I don't believe the cachegrind program understands exclusive but I did not look at the source to see). It might could be done better. But it is _definitely_ something far better than what we had 24 hours ago to discuss. Current crafty. Known hash size. Known test set. Etc... > >-Tom
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.