Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Sempron vs. Athlon 64: Proof that Crafty's working set is < 256k

Author: Robert Hyatt

Date: 21:15:55 08/20/04

Go up one level in this thread


On August 20, 2004 at 22:19:11, Tom Kerrigan wrote:

>On August 20, 2004 at 21:28:20, Robert Hyatt wrote:
>
>>On August 20, 2004 at 17:52:54, Tom Kerrigan wrote:
>>
>>>On August 20, 2004 at 17:36:51, Robert Hyatt wrote:
>>>
>>>...
>>>
>>>>As I said, I don't know.  But clearly testing 256K vs 512K doesn't provide much
>>>>actual data to draw conclusions from.  Obviously the 2048K chip was not 5x
>>>
>>>What is it that you don't know? If a program's working set doesn't fit into
>>>cache, then adding more cache will always increase performance, assuming a
>>>completely random access pattern.
>>
>>Why do you get to make such an assumption?  I +specifically+ try to do lots of
>>sequential accesses to take advantage of cache line fills that pre-fetch data...
>
>I used the word "assuming" here to indicate the condition that my statement
>applies to. I obviously don't think that all programs have completely random
>memory access patterns; what kind of idiot would think that? Yet you use it as a
>strawman argument for the rest of your post.

I had no "strawman" argument.  You said, and I quote, "this proves that the
working set of crafty is < 256K".  I said, and I quote "this proves _no_ such
thing."

That has been the limit of my argument.  Your test "proved" nothing.  Your
conclusion might well be right for all I know.  But your test certainly doesn't
prove that it is.

All the other faulty logic is irrelevant.  I provided data a long while back on
cache usage.  Eugene ran his 1.5/3mb test a while back.  We both saw
improvement...  You didn't.  What to conclude?  Certainly _not_ that the working
set is < 256K.


>
>You're right that there's usually a lot of variation between systems with
>different sized L2 caches. That's a good explanation for why you and Eugene saw
>speedups with more cache. In my case, my numbers are from systems that are
>identical except for L2 cache size.
>
>There are two other ways that I can think of to approach this question:
>
>1) If Crafty is constantly hammering main memory, it would scale very poorly
>with processor clock speed. Is this the case? I've seen posts that indicate that
>Crafty scales perfectly with processor clock speed.
>


I can't measure that.  I have never had two boxes with the same everything
except for raw clock speed.  I have had access to machines that were identical
in all respects but one.  Bus speed for example.  Or L2 cache size.  But I can't
answer about the clock speed.  IE my 400 vs 550mhz xeons had a different bus
speed.  The only personal cache data I have measured came from 400mhz PII xeons,
with 512K, 1024K and 2048K.  But I can not begin to claim that the processors
were identical in all other respects.  IE set associativity.  I'd think so as
they were all produced and were current at the same point in time, but I am not
sure.  I was trying to answer the question posed by someone "is the pricier CPUs
worth the cost with Crafty?"

The cache simulator mentioned earlier ought to be able to answer this very
specifically, I have it on my list of things to fiddle with this weekend if all
goes quietly...




>2) If Crafty is constantly hammering main memory, you would get a very poor
>speedup running several threads on a shared memory bus machine (like a quad
>Xeon). What kind of speedups do you see? 1.1x? 1.2x? Or closer to 3.5x-4.0x?

That is wrong.  My quads all have 4=way memory interleaving, which provides 4x
the bandwidth as a normal 1-way 1cpu box.  However, on an 8way xeon I saw
horrible performance because they still rely on 4-way interleaving.  That is why
I have never run on an 8-way box.  One day of benchmarking showed they were not
so good.  IE well under 1.5x faster than a 4-way box and that is not factoring
in the SMP overhead in the tree search.

I have also posted some SMP numbers in the past that show that on my old quad
700, running 4 crafty processes was not pure 4x faster.  each additional crafty
runs about 7% slower if I recall correctly.  So the 4-way interleaving is not a
perfect solution but it isn't bad..

And for heaven's sake let's don't rehash the latency vs bandwidth stuff again.
4-way quadruples bandwidth, but leaves latency unchanged.



>
>-Tom



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.