Author: Robert Hyatt
Date: 10:24:39 09/03/03
Go up one level in this thread
On September 03, 2003 at 09:29:52, Vincent Diepeveen wrote: >On September 02, 2003 at 22:56:43, Robert Hyatt wrote: > >So no crafty running very well at 8 of those opterons ever then because latency >will be > 10 us. Just hide and watch. Even myrinet has a latency of a microsecond or _two_ at most, not > 10. > >>On September 02, 2003 at 17:24:58, Vincent Diepeveen wrote: >> >>>On September 01, 2003 at 23:50:09, Robert Hyatt wrote: >>> >>>>On August 29, 2003 at 12:46:04, Vincent Diepeveen wrote: >>>> >>>>>On August 29, 2003 at 08:53:50, Robert Hyatt wrote: >>>>> >>>>>>On August 29, 2003 at 02:34:42, Johan de Koning wrote: >>>>>> >>>>>>>On August 28, 2003 at 11:45:35, Robert Hyatt wrote: >>>>>>> >>>>>>>>On August 28, 2003 at 01:50:46, Johan de Koning wrote: >>>>>>>> >>>>>>>>>On August 27, 2003 at 12:25:51, Robert Hyatt wrote: >>>>>>>>> >>>>>>>>>>On August 26, 2003 at 21:12:45, Johan de Koning wrote: >>>>>>>>> >>>>>>>>>[snip] >>>>>>>>> >>>>>>>>>It seems we're finally back to where you and me started off. >>>>>>>>> >>>>>>>>>>>You keep saying that copy/make causes problems with cach to memory traffic. >>>>>>>>>>>Here I was just saying it doesn't, if cache is plenty. >>>>>>>>>> >>>>>>>>>>Here is the problem: >>>>>>>>>> >>>>>>>>>>When you write to a line of cache, you _guarantee_ that entire line of cache >>>>>>>>>>is going to be written back to memory. There is absolutely no exceptions to >>>>>>>>>>that. So copying from one cache line to another means that "another line" is >>>>>>>>>>going to generate memory traffic. >>>>>>>>> >>>>>>>>>Here is the solution: write-through caches were abondoned a long time ago. >>>>>>>> >>>>>>>>I'm not talking about write-through. >>>>>>> >>>>>>>I'm glad you aren't. :-) >>>>>>> >>>>>>>> I am talking about write-back. Once >>>>>>>>you modify a line of cache, that line of cache _is_ going to be written back >>>>>>>>to memory. When is hard to predict, but before it is replaced by another cache >>>>>>>>line, it _will_ be written back. So you write one byte to cache on a PIV, you >>>>>>>>are going to dump 128 bytes back to memory at some point. With only 4096 lines >>>>>>>>of cache, it won't be long before that happens... And there is no way to >>>>>>>>prevent it. >>>>>>> >>>>>>>Sure, every dirty cache line will be written back at *some* point. But you're >>>>>>>allowed to use or update it a million times before it is flushed only once. >>>>>>>Number of cache lines has nothing to do with it. On a lean and empty system >>>>>>>some lines might even survive until after program termination. >>>>>> >>>>>>Number of cache lines has everything to do with it. If you can keep 4K >>>>>>chunks of a program in memory, and the program is _way_ beyond 4K chunks >>>>>>in size of the "working set", then cache is going to thrash pretty badly. >>>>>>I've already reported that I've tested on 512K, 1024K and 2048K processors, >>>>>>and that I have seen an improvement every time L2 gets bigger. >>>>>> >>>>>>As I said initially, my comments were _directly_ related to Crafty. Not to >>>>>>other mythical programs nor mythical processor architectures. But for Crafty, >>>>>>copy/make was slower on an architecture that is _very_ close to the PIV of >>>>>>today, albiet with 1/2 the L2 cache, and a much shorter pipeline. >>>>>> >>>>>> >>>>>>> >>>>>>>>>And for good reason, think of the frequency at wich data is written (eg just >>>>>>>>>stack frame). Once CPU speed / RAM speed hits 10 or so, write-through cache will >>>>>>>>>cause almost any program to run RAM bound. >>>>>>>> >>>>>>>>Sure, but that wasn't what I was talking about. Once a line is "dirty" it is >>>>>>>>going back to memory when it is time to replace it. With just 4K lines of >>>>>>>>cache, they get recycled very quickly. >>>>>>>> >>>>>>>>> >>>>>>>>>>>> I claimed that for _my_ program, >>>>>>>>>>>>copy/make burned the bus up and getting rid of it made me go 25% faster. >>>>>>>>>>> >>>>>>>>>>>And I suspect this was because of a tiny cache that couldn't even hold the >>>>>>>>>>>heavily used stuff. >>>>>>>>>> >>>>>>>>>>This was on a (originally) pentium pro, with (I believe) 256K of L2 cache. >>>>>>>>> >>>>>>>>>L2 is not a good place to keep your heavily used data. >>>>>>>> >>>>>>>>There's no other choice. L1 is not big enough for anything. >>>>>>> >>>>>>>It's big enough to hold your position and top of stack. It's even big enough to >>>>>>>hold *my* position of 22000 bytes, except for the rarely addressed parts. >>>>>>> >>>>>> >>>>>>It isn't big enough to hold even the stuff I need to generate moves. I have >>>>>>multiple arrays of 64 X 256 X 8bytes, that I use repeatedly. One of those >>>>>>is enough to zap L1, although I don't need the entire table at one shot. But >>>>>>I do need parts of four of those, and that is just for starters... >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>The less heavily used data will live briefly in the LRU lines but is typically >>>>>>>not dirty. Though it is certainly possible to get unlucky and flush hot data, >>>>>>>depending on memory lay-out and program flow. >>>>>>> >>>>>>>> IE the pentium >>>>>>>>pro had 16K of L1, 8K data, 8K instruction. Newer pentiums are not much >>>>>>>>better although the 8K instruction has been replaced by the new trace cache >>>>>>>>that holds more than 8KB. And the data cache is up to 16K. However, I have >>>>>>>>run personally on xeons with 512K L2, 1024K L2 and 2048K L2 and I didn't see >>>>>>>>any significant difference in performance for my program... Bigger is slightly >>>>>>>>better in each case, but it was never "big enough". >>>>>>> >>>>>>>I guess most of your tables are pretty sparse in terms of access frequency. So >>>>>>>you might get away with 2048 lines of L2. In fact I'm pretty sure you get away >>>>>>>with it since a few RAM accesses per node would kill any 1+ MN/s badly. >>>>>>> >>>>>>>But regarding L1 size: Intel's policy simply sucks. :-) >>>>>> >>>>>>I wouldn't say it "sucks". You _can_ get a 2048K L2 cache xeon. If you can >>>>>>afford it. :) >>>>> >>>>>http://www.intel.com/ebusiness/products/server/processor/xeon_mp/index.htm >>>>> >>>>>But that 2MB L2 cache Xeon, of course ignoring that it is priced about what my >>>>>car is worth, it is clocked to only 2.8Ghz. >>>>> >>>>>So that will be blown away by any $60 K7 processor. >>>> >>>>What 60 buck chip is going to rip that 2.8ghz xeon??? >>>> >>>> >>>>> >>>>>Not to mention opteron. >>>>> >>>>>Talking about opteron, when are you going to buy that 64 bits cpu? >>>>> >>>>>For 10 years you have been crying about getting 64 bits. Now there is a 64 bits >>>>>cpu and you don't have such a system yet? >>>> >>>>Eh? I've had an alpha for 5 years. >>>> >>>>I'm in the process of ordering a 4-node dual opteron cluster to play with. >>>>I'll have some soon, so that I can get real data rather than hyperbole. >>> >>>Cool. one partition with a new beta kernel (if they can do that) or MPI driven? >>> >>>Best regards, >>>Vincent >>> >> >>Wake up and _read_. >> >>I said a "cluster" of four dual opterons. There are no "partitions" or >>anything else there. A cluster = 4 machines connected with something that is >>fast, probably myrinet since cLAN is too expensive. >> >>>>> >>>>>IBM E325 looks cool. Quad opteron 2.0Ghz. >>>>> >>>>>>Bigger L1 would be nice, and it will probably happen soon. >>>>>> >>>>>>Of course X86 is crippled for many more reasons than that. 8 registers for >>>>>>starters. :) >>>>> >>>>>16 registers at x86-64. >>>> >>>>So? I said "X86". >>>> >>>>> >>>>>Rumours say intel will take till 2005 before you can buy their x86-64 cpu. Are >>>>>you going to wait till then or already buy an opteron? >>>> >>>>I will have some dual opterons before long. >>> >>>>> >>>>>> >>>>>>> >>>>>>>... Johan
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.