Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Sempron vs. Athlon 64: Proof that Crafty's working set is < 256k

Author: Robert Hyatt

Date: 08:10:33 08/22/04

Go up one level in this thread


On August 22, 2004 at 03:03:08, Tom Kerrigan wrote:

>On August 21, 2004 at 00:15:55, Robert Hyatt wrote:
>
>>I had no "strawman" argument.  You said, and I quote, "this proves that the
>>working set of crafty is < 256K".  I said, and I quote "this proves _no_ such
>>thing."
>>
>>That has been the limit of my argument.  Your test "proved" nothing.  Your
>>conclusion might well be right for all I know.  But your test certainly doesn't
>>prove that it is.
>>
>>All the other faulty logic is irrelevant.  I provided data a long while back on
>
>Just because you call my logic faulty and irrelevant doesn't make it so.

Yes it does.  You said "I ran crafty on a 256K and 512K L2 processor, and the
speed was the same.  That proves that the working set is < 256K."

That is wrong.  It is wrong for obvious reasons.  And it will _always_ be wrong.

Your conclusion may be perfectly accurate for all I know.  That isn't the point.
 Your experiment was flawed and showed _nothing_ except strange behavior because
two other similar experiments (different processors) produced significant
speedup as cache got bigger.


>
>The only analysis of the situation that you've given (in the posts to Sune
>below) is BLATANTLY faulty. Saying that you access your attack bitboard arrays
>at random is ridiculous. For that to be true, every position you visit and
>evaluate must be completely different from the next. How is that possible when
>the rules of chess only allow you to move 1-2 pieces at a time? And a randomly
>generated position is most likely not even legal?


Simple.  I access a batch of random attack entries.  I then access a _lot_ of
other stuff before I come back to the attack entries.  Your 256K cache has 4K
lines.  I don't know what the set associativity is, as AMD has lots of options
there in recent history.  But that further reduces the number of "buckets" to
stuff stuff in.  My xeon claims 16-way set associativity, with 128 byte lines.
That turns into a paultry 256 sets.  It is very easy to get a bad physical
memory layout where you don't even use all of those sets, and where some sets
get badly overloaded.

The problem is that one random access eats up 128 bytes of my L2, or 64 bytes of
the AMD L2.  And that is why the "perft" test is interesting but has nothing to
do with the speed of playing chess, because it does tend to sit in cache
completely since all it does is simple stuff.  But fetching 64 or 128 byte
chunks of stuff from RAM changes things when only 8 of those bytes are used.
That is the point.






>
>But let's say for the moment that your implication is true and Crafty evaluates
>completely random positions one after the other. That means increasing L2 cache
>would STILL improve Crafty's performance because you'd be randomly accessing a
>bigger random subset of your working set.
>
>The only way that increasing L2 cache wouldn't affect Crafty's performance
>(assuming a large working set) is if you accessed your arrays in a carefully
>bizarre manner. Do you have code in Crafty to move a piece in such a way that it
>causes a cache miss every time you evaluate it?


Do you understand how cache gets filled on misses?  Doesn't take totally random
access to cause that.  On the AMD. with 64 byte lines, each miss flushes 64
bytes out, after they were brought in and only 8 were used.

Again, I am not going to argue about what my working set is as I don't care.  I
am simply saying your test was flawed, and that it produced data that _clearly_
contradicted two other tests that were run.

I give students a programming assignment to figure out L1/L2 cache size, line
size, set associativity, and access times.  The only variable they have is that
they can't control physical memory usage since they are running under an O/S.
That makes things more complicated.  If you have tried, or do try to do this,
you'll see the problem.  And you will see why running such a test program with
just two different suspected cache sizes, and then concluding something because
both ran at the same speed, is simply meaningless for drawing any conclusion.
If you don't see that, there's little more I can say about it.



>
>The only remotely plausible explanation for not seeing an increase in chess
>program* performance going from 256k to 512k cache is that virtually all of the
>data it needs fits in 256k of cache.

Or that all the data it needs requires 2mb of cache.  _either_ is equally
plausible.


>
>* Not some arbitrary hypothetical contrived program
>
>(Actually, if you want even more data to support my conclusion, I just ran
>Crafty on a dual proc with a shared memory bus. When I run two copies
>simultaneously, each copy is exactly as fast as when I only run one copy.
>Meaning that memory accesses are so sparse that there might as well be no
>contention.)
>
>-Tom


that is data I can't reproduce.  Which leads me to believe you have something
seriously wrong.  Every dual/quad I have slows down when I run two copies.
Something in the range of 7% for my dual, 10% for my older quad.




This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.