Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Full circle

Author: Vincent Diepeveen
Date: 06:29:52 09/03/03
On September 02, 2003 at 22:56:43, Robert Hyatt wrote:

So no crafty running very well at 8 of those opterons ever then because latency
will be > 10 us.

>On September 02, 2003 at 17:24:58, Vincent Diepeveen wrote:
>
>>On September 01, 2003 at 23:50:09, Robert Hyatt wrote:
>>
>>>On August 29, 2003 at 12:46:04, Vincent Diepeveen wrote:
>>>
>>>>On August 29, 2003 at 08:53:50, Robert Hyatt wrote:
>>>>
>>>>>On August 29, 2003 at 02:34:42, Johan de Koning wrote:
>>>>>
>>>>>>On August 28, 2003 at 11:45:35, Robert Hyatt wrote:
>>>>>>
>>>>>>>On August 28, 2003 at 01:50:46, Johan de Koning wrote:
>>>>>>>
>>>>>>>>On August 27, 2003 at 12:25:51, Robert Hyatt wrote:
>>>>>>>>
>>>>>>>>>On August 26, 2003 at 21:12:45, Johan de Koning wrote:
>>>>>>>>
>>>>>>>>[snip]
>>>>>>>>
>>>>>>>>It seems we're finally back to where you and me started off.
>>>>>>>>
>>>>>>>>>>You keep saying that copy/make causes problems with cach to memory traffic.
>>>>>>>>>>Here I was just saying it doesn't, if cache is plenty.
>>>>>>>>>
>>>>>>>>>Here is the problem:
>>>>>>>>>
>>>>>>>>>When you write to a line of cache, you _guarantee_ that entire line of cache
>>>>>>>>>is going to be written back to memory.  There is absolutely no exceptions to
>>>>>>>>>that.  So copying from one cache line to another means that "another line" is
>>>>>>>>>going to generate memory traffic.
>>>>>>>>
>>>>>>>>Here is the solution: write-through caches were abondoned a long time ago.
>>>>>>>
>>>>>>>I'm not talking about write-through.
>>>>>>
>>>>>>I'm glad you aren't. :-)
>>>>>>
>>>>>>>  I am talking about write-back.  Once
>>>>>>>you modify a line of cache, that line of cache _is_ going to be written back
>>>>>>>to memory.  When is hard to predict, but before it is replaced by another cache
>>>>>>>line, it _will_ be written back.  So you write one byte to cache on a PIV, you
>>>>>>>are going to dump 128 bytes back to memory at some point.  With only 4096 lines
>>>>>>>of cache, it won't be long before that happens...  And there is no way to
>>>>>>>prevent it.
>>>>>>
>>>>>>Sure, every dirty cache line will be written back at *some* point. But you're
>>>>>>allowed to use or update it a million times before it is flushed only once.
>>>>>>Number of cache lines has nothing to do with it. On a lean and empty system
>>>>>>some lines might even survive until after program termination.
>>>>>
>>>>>Number of cache lines has everything to do with it.  If you can keep 4K
>>>>>chunks of a program in memory, and the program is _way_ beyond 4K chunks
>>>>>in size of the "working set", then cache is going to thrash pretty badly.
>>>>>I've already reported that I've tested on 512K, 1024K and 2048K processors,
>>>>>and that I have seen an improvement every time L2 gets bigger.
>>>>>
>>>>>As I said initially, my comments were _directly_ related to Crafty.  Not to
>>>>>other mythical programs nor mythical processor architectures.  But for Crafty,
>>>>>copy/make was slower on an architecture that is _very_ close to the PIV of
>>>>>today, albiet with 1/2 the L2 cache, and a much shorter pipeline.
>>>>>
>>>>>
>>>>>>
>>>>>>>>And for good reason, think of the frequency at wich data is written (eg just
>>>>>>>>stack frame). Once CPU speed / RAM speed hits 10 or so, write-through cache will
>>>>>>>>cause almost any program to run RAM bound.
>>>>>>>
>>>>>>>Sure, but that wasn't what I was talking about.  Once a line is "dirty" it is
>>>>>>>going back to memory when it is time to replace it.  With just 4K lines of
>>>>>>>cache, they get recycled very quickly.
>>>>>>>
>>>>>>>>
>>>>>>>>>>>  I claimed that for _my_ program,
>>>>>>>>>>>copy/make burned the bus up and getting rid of it made me go 25% faster.
>>>>>>>>>>
>>>>>>>>>>And I suspect this was because of a tiny cache that couldn't even hold the
>>>>>>>>>>heavily used stuff.
>>>>>>>>>
>>>>>>>>>This was on a (originally) pentium pro, with (I believe) 256K of L2 cache.
>>>>>>>>
>>>>>>>>L2 is not a good place to keep your heavily used data.
>>>>>>>
>>>>>>>There's no other choice.  L1 is not big enough for anything.
>>>>>>
>>>>>>It's big enough to hold your position and top of stack. It's even big enough to
>>>>>>hold *my* position of 22000 bytes, except for the rarely addressed parts.
>>>>>>
>>>>>
>>>>>It isn't big enough to hold even the stuff I need to generate moves.  I have
>>>>>multiple arrays of 64 X 256 X 8bytes, that I use repeatedly.  One of those
>>>>>is enough to zap L1, although I don't need the entire table at one shot.  But
>>>>>I do need parts of four of those, and that is just for starters...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>The less heavily used data will live briefly in the LRU lines but is typically
>>>>>>not dirty. Though it is certainly possible to get unlucky and flush hot data,
>>>>>>depending on memory lay-out and program flow.
>>>>>>
>>>>>>>  IE the pentium
>>>>>>>pro had 16K of L1, 8K data, 8K instruction.  Newer pentiums are not much
>>>>>>>better although the 8K instruction has been replaced by the new trace cache
>>>>>>>that holds more than 8KB.  And the data cache is up to 16K.  However, I have
>>>>>>>run personally on xeons with 512K L2, 1024K L2 and 2048K L2 and I didn't see
>>>>>>>any significant difference in performance for my program...  Bigger is slightly
>>>>>>>better in each case, but it was never "big enough".
>>>>>>
>>>>>>I guess most of your tables are pretty sparse in terms of access frequency. So
>>>>>>you might get away with 2048 lines of L2. In fact I'm pretty sure you get away
>>>>>>with it since a few RAM accesses per node would kill any 1+ MN/s badly.
>>>>>>
>>>>>>But regarding L1 size: Intel's policy simply sucks. :-)
>>>>>
>>>>>I wouldn't say it "sucks".  You _can_ get a 2048K L2 cache xeon.  If you can
>>>>>afford it.  :)
>>>>
>>>>http://www.intel.com/ebusiness/products/server/processor/xeon_mp/index.htm
>>>>
>>>>But that 2MB L2 cache Xeon, of course ignoring that it is priced about what my
>>>>car is worth, it is clocked to only 2.8Ghz.
>>>>
>>>>So that will be blown away by any $60 K7 processor.
>>>
>>>What 60 buck chip is going to rip that 2.8ghz xeon???
>>>
>>>
>>>>
>>>>Not to mention opteron.
>>>>
>>>>Talking about opteron, when are you going to buy that 64 bits cpu?
>>>>
>>>>For 10 years you have been crying about getting 64 bits. Now there is a 64 bits
>>>>cpu and you don't have such a system yet?
>>>
>>>Eh?  I've had an alpha for 5 years.
>>>
>>>I'm in the process of ordering a 4-node dual opteron cluster to play with.
>>>I'll have some soon, so that I can get real data rather than hyperbole.
>>
>>Cool. one partition with a new beta kernel (if they can do that) or MPI driven?
>>
>>Best regards,
>>Vincent
>>
>
>Wake up and _read_.
>
>I said a "cluster" of four dual opterons.  There are no "partitions" or
>anything else there.  A cluster = 4 machines connected with something that is
>fast, probably myrinet since cLAN is too expensive.
>
>>>>
>>>>IBM E325 looks cool. Quad opteron 2.0Ghz.
>>>>
>>>>>Bigger L1 would be nice, and it will probably happen soon.
>>>>>
>>>>>Of course X86 is crippled for many more reasons than that.  8 registers for
>>>>>starters.  :)
>>>>
>>>>16 registers at x86-64.
>>>
>>>So?  I said "X86".
>>>
>>>>
>>>>Rumours say intel will take till 2005 before you can buy their x86-64 cpu. Are
>>>>you going to wait till then or already buy an opteron?
>>>
>>>I will have some dual opterons before long.
>>
>>>>
>>>>>
>>>>>>
>>>>>>... Johan
Re: Full circle Robert Hyatt 10:24:39 09/03/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.