Author: Robert Hyatt
Date: 20:48:39 09/29/02
Go up one level in this thread
On September 29, 2002 at 11:26:28, Vincent Diepeveen wrote: >On September 27, 2002 at 23:50:28, Robert Hyatt wrote: > >www.top500.org to see this supercomputer and many others. > >obviously a supercomputer is defined by having huge bandwidth >especially for i/o. 1 terabyte a second is enough? Not particularly. If you want to talk "clusters" then fine, for those applications that work well on clusters. I am talking about the classic "supercomputer" that is pretty well-defined. And there we are talking about memory bandwidth for a single cpu, because a single cpu has to execute code that has a tremendous demand for memory bandwidth. A cray for example. Say the Cray 3 quadrant at NCAR. some 50 gigabytes per second for a single cpu... Having 1000 processors each doing a trifling amount of I/O or memory transfers in parallel is _not_ a supercomputer. > >anyway this is distracting again. > >your definition of a supercomputer is very limited to your own >programming capabilities!! I doubt my programming abilities are nearly as limited as yours. But that has nothing to do with the definition. It isn't "My definition". It is the definition used round the world... > >>On September 27, 2002 at 12:35:39, Vincent Diepeveen wrote: >> >>>On September 26, 2002 at 11:46:34, Robert Hyatt wrote: >>> >>>>On September 26, 2002 at 08:17:11, Vincent Diepeveen wrote: >>>> >>>>>On September 25, 2002 at 16:10:51, Robert Hyatt wrote: >>>>> >>>>>>On September 25, 2002 at 14:45:01, Vincent Diepeveen wrote: >>>>>> >>>>>>>On September 25, 2002 at 14:17:55, Robert Hyatt wrote: >>>>>>> >>>>>>>>On September 25, 2002 at 11:49:25, Russell Reagan wrote: >>>>>>>> >>>>>>>>>On September 25, 2002 at 08:10:06, Vincent Diepeveen wrote: >>>>>>>>> >>>>>>>>>>i cannot use select() at all as i limit myself to < 128 processor >>>>>>>>>>partitions then. >>>>>>>>> >>>>>>>>>Does WaitForSingleObject or WaitForMultipleObjects allow you to use more than >>>>>>>>>128 processors? >>>>>>>>> >>>>>>>>>>also i have no idea how to get it to work and whether it can do >>>>>>>>>>it 400 times a second instantly. >>>>>>>>> >>>>>>>>>See the problems Microsoft causes? They always have to be different (in a bad >>>>>>>>>evil kind of way). >>>>>>>> >>>>>>>> >>>>>>>>The very concept of synchronizing > 128 processes with such a system call >>>>>>>>defies any sort of logic I can think of. Doing it hundreds of times a second >>>>>>>>only guarantees horrible performance. Doing this every now and then would >>>>>>>>be ok. But not "400 times a second". There are other ways to eliminate that >>>>>>>>kind of stuff... >>>>>>> >>>>>>>the first 10 ply you get out of hashtable with just 1 processor obviously. >>>>>>> >>>>>>>Only after that the other 511 can join in. they manage themselves after >>>>>>>this. I don't want to let them poll while they are not searching. i have >>>>>>>no option there. >>>>>> >>>>>>You simply don't understand memory architecture yet. "polling" works >>>>>>perfectly. You wait on a pointer to become non-zero for example. While >>>>>>you are "polling" it takes _no_ bandwidth because the value lands in your >>>>>>local cache. All the cache controllers have to do is inform each other when >>>>>>something changes. IE this is how the Intel boxes work and they do that quite >>>>>>well. >>>>>> >>>>>>NUMA shouldn't change that basic idea... >>>>> >>>>>There is a good reason why it even goes wrong at NUMA. >>>>> >>>>>The basic problem is that also at the NUMA machines that each node >>>>>is 2 processors and not 1, in the latter case you would be correct. >>>> >>>>This varies. I have run on an alpha with 4 cpus per node. I have run on >>>>a xeon with 1 cpu per "node" (non-numa). Intel is building multiple NUMA >>>>platforms, although it seems they are settling in to 4 cpus per node as the >>>>basic building block, something like N-cube did years ago. >>>> >>>>> >>>>>We can see this dual cpu as next: >>>>> >>>>> CPU1 ----- >>>>> |--- HUB --| Memory banks for CPU1 & CPU2 >>>>> CPU2 ----- | >>>>> | >>>>> SN0 router >>>>> >>>>>So if 1 cpu is hammering on a single cache line in local memory >>>>>it is using the hub to access the memory banks. This is true for >>>>>the R14000 design which has 2GB local memory at the memory banks >>>>>and 8MB L2 cache. >>>> >>>>That should _not_ be working like that. If cpu is hammering on a single >>>>cache line, it should not even be talking to the hub. It should be talking >>>>to its _local_ cache. That's the point of what are often called "shadow" >>> >>>As you know one must prevent race conditions. Especially hardware >>>race conditions. >>> >>>Supercomputers can therefore not do this PC thing Bob. >> >>VIncent, the machine you are using is _not_ a super-computer. It is a >>collection of off-the-shelf MIPS microprocessors, built into a NUMA-based >>cluster. >> >>Please use the _right_ term. A cray is a super-computer. Or a hitachi, or >>a fujitsu. Not a MIPS-based machine. >> >>as far as this "pc" thing, _your_ company might not do it. _others_ do. >>it is a well-known approach to cache coherency and _nobody_ tolerates race >>conditions... >> >>> >>>In order to prevent race conditions in hardware everything first talks to >>>the hub, the hub then talks to the memory. >>> >>>In that way the SN0 routers do not face with race conditions versus processors >>>poking directly into memory. >>> >>>So every memory reference goes through the hub. i say 'hub' because that's >>>how SGI calls it. >>> >>>>operations. You make a "shadow" of something by putting it in cache, to get >>>>off the system bus. that is how my spinlocks work on my quad xeon in fact: >>>> >>>># define Lock(p) while(exchange(&p,1)) while(p) >>>> >>>>The "exchange" is horribly slow on the intel box, because it locks the bus >>>>down tight while reading/writing to accomplish the exchange. But it does it >>>>in an atomic manner which is required for true locks. >>> >>>>however, if the exchange says "value is already non-zero" I then do a simple >>>>while(p) which spins on the value of P in the L1 cache. No bus activity at >>>>all while I am spinning. when _another_ processor writes to p, my cache >>>>controller finds out and invalidates that cache line. that causes it to >>>>do a cache line fill to get the new value. If that value is zero, the >>>>while(P) terminates and we go back to try a real exchange(P,1) atomically. >>> >>>At this big supercomputer the hub is in between >>>connected to the SN0 router which needs cache >>>lines too in an atomic way. >>> >>>Locking in NUMA happens in the shared memory (see advice in the manuals) >>>preferably as atomic operations take care then it goes ok. >>> >>>In this case we are not busy locking anyhow. We are simply trying to >>>get a cache line over and over again and the processor can deliver that much >>>requests a second that it provable in software keeps the SN0 or other cpu >>>very busy. >>> >>>Obviously, your whole theory might have some truths for the pc. I'm not >>>sure here. I'm not sure whether it is a smart idea to do it on the pc >>>either, cache coherency or not. The chipset of the dual K7 can deliver >>>also only 600MB/s (i do not see how this will soon change for big systems >>>in the future either; the hammer seems to have a good alternative for up >>>to 8 processors though; though i would be happy already if they manage >>>to get it to work dual within a year or 2), in short the other cpu >>>can't do a thing with the chipset when it's just busy poking into the >>>other cpu. Using L2 cache coherency and/or chipset, i don't care. It's >>>keeping it busy *somehow*. >> >> >>No it isn't, but I am not going to waste a lot more time trying to explain >>how modern multiprocessor machines take care of cache coherency. It is a >>well-known issue, and it has been solved by everyone. And it doesn't require >>that a cpu accessing data in its local cache to keep other cpus busy asking >>"what are you doing now?" type questions... Find a good architecture book, >>look for multiprocessor architectures and cache coherency in the index, and >>dive in. It isn't that complicated... >> >> >>> >>>Any cpu can deliver up to 600MB/s easily, this is *no problem*. >>> >>>Suppose the next thing. We have 2 players in a field. one player is >>>at maximum speed shipping tells to the other player, which MUST get answerred, >>>I WANT TO KNOW WHAT HAPPENS ABOUT THIS. >>> >>>How they solve this in the PC i don't know, ignore it? >> >>They design _around_ it so that it doesn't happen... >> >> >>> >>>I know the intels are having a bit more primitive way of handling such >>>requests than the supercomputers/amd chipset. >> >>When you run on a supercomputer, tell me more. I have run on every >>"supercomputer" ever made. :) >> >>> >>>Alpha, R14000, AMD they all use a master/slave idea. Each cpu can set >>>ownership on a cache line. A great system. I know the intel Xeon chipset >>>didn't take this into account. I guess the P4 Xeons don't either? >> >> >>Nope, the intel approach makes much more sense, and, in fact, their approach >>is used by _most_. It's a well-known problem with well-known solutions that >>have been around for _years_... >> >> >> >>> >>>Perhaps the lack of this advanced feature which is great if you take >>>it into account, is the reason why the intel chipset is fast for you? >> >> >>I don't have a clue what you are talking about... >> >> >> >>> >>>>That is called a "shadow lock" and is _the_ solution to keep spinning processes >>>>off the memory bus so that there is no interference with other processors. >>> >>>> >>>> >>>> >>>>> >>>>>However it can't be working only within the L2 cache, because it is >>>>>a read straight from local memory. >>>> >>>>Then you don't have a real L2 cache there. Because the purpose of any >>>>cache is to _avoid_ going to real memory whenever possible, whether it is >>>>local or not. >>>> >>>>> >>>>>Additionally in all supercomputer designs each cache line of memory >>>>>has a number of bits added to it for who owns the memory cache line. >>>>> >>>>>You are correctly referring to the fact that if i do a write to the >>>>>memory now, that cpu1 is only referencing to the local memory. >>>>> >>>>>So your assumption is correct if each node was a single cpu. >>>>> >>>>>However it is a dual and sometimes even a quad. This means in short >>>>>that many cpu's, at least CPU2 is hammering also at the same time >>>>>in the memory, and the bandwidth from this is only a fraction of >>>>>what the 8MB L2 cache delivers. Every reference to main memory goes >>>>>through the Hub simply which delivers 600MB/s effectively. >>>> >>>>However, if a node is a quad, then each processor had _better_ have a real >>>>processor cache or the machine is _never_ going to perform, since the >>>>router speed can't even feed one cpu fully, much less 4... Better check the >>>>processor specs on that box as I can't imagine MIPS doing something that >>>>poorly, based on past things they have produced. >>>> >>>>> >>>>>We can both imagine that this is a major problem at supercomputers. >>>>> >>>>>Therefore the approach to let them idle while not doing a job is >>>>>the right one. >>>> >>>> >>>>I simply don't think they operate as you describe, otherwise they will >>>>perform like dogs, period. Each CPU _must_ have a local processor cache, >>>>or else a group of four cpus must have a 4-port local cache they share. >>>>They have to have _something_ to avoid memory I/O even if it is local memory. >>>>Memory is _always_ slow. Local or not. >>>> >>>> >>>> >>>> >>>>> >>>>>You are obviously correct if the system design would be each node >>>>>is a single cpu based. That is not the case however. >>>>> >>>>>Another effect is that the SN0 router is also attached to the Hub, >>>>>i will not describe the effects here as i have no theory to base myself >>>>>upon, but from my experiments i saw that if one of the CPU's is poking >>>>>into its own local memory, that the SN0 router is busy somehow too with >>>>>things which slows down the entire search simply as it eats away >>>>>bandwidth. >>>>> >>>>>My knowledge of what intelligent algorithms the SN0 router uses >>>>>to predict things is very limited. Near to zero actually. But somehow >>>>>i feel it is not a good idea to keep the SN0 router also busy indirectly >>>>>with reads to the local memory while CPU2 is searching its own way. >>>>> >>>>>So the absolutely best solution to the problem is pretty trivial and that's >>>>>letting processes that do not search idle simply. >>>> >>>> >>>>What is wrong with the above is the definition of "idle". If _your_ process >>>>is not running, what is the O/S doing? It is executing code all the time >>>>itself, the famous "idle loop"... >>>> >>>> >>>>> >>>>>The real basic problem one is confronted with in general is that initially >>>>>a n processor partition idles a lot more than a dual or quad cpu box is >>>>>doing. >>>>> >>>>>So we both hardly feel the fact that at the PC we are poking into a single >>>>>cache line. At the supers this is a different thing however. >>>>> >>>>>They do not like me wasting bandwidth at all :) >>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>Just imagine how much bandwidth that takes away. >>>>>>> >>>>>>>The machine has 1 TB a second bandwidth. This 512p partition has >>>>>>>about half of that (machine has 1024p in total). If all these procs are >>>>>>>spinning around that will confuse the SN0 routers completely. >>>>>> >>>>>>Not with cache. >>>>>> >>>>>> >>>>>>> >>>>>>>From that 0.5 TB bandwidth about 0.25TB a second is reserved for harddisk >>>>>>>i/o. I can't use it. I use the other 600MB/s which the hub delivers for >>>>>>>2 processors for memory bandwidth. >>>>>>> >>>>>>>We all can imagine what happens if the hubs are only busy delivering reads >>>>>>>to the RAM. >>>>>>> >>>>>>>Right now each processor spins at its own shared memory variables, >>>>>>>that's simply *not* a good design idea for NUMA. It is not working for >>>>>>>any machine actually above 8 procs. >>>>>> >>>>>>It would work fine on a Cray with 32... >>>>>> >>>>>> >>>>>>> >>>>>>>Also shared memory buses you can't work with that design at all.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.