Author: Vincent Diepeveen
Date: 08:26:28 09/29/02
Go up one level in this thread
On September 27, 2002 at 23:50:28, Robert Hyatt wrote: www.top500.org to see this supercomputer and many others. obviously a supercomputer is defined by having huge bandwidth especially for i/o. 1 terabyte a second is enough? anyway this is distracting again. your definition of a supercomputer is very limited to your own programming capabilities!! >On September 27, 2002 at 12:35:39, Vincent Diepeveen wrote: > >>On September 26, 2002 at 11:46:34, Robert Hyatt wrote: >> >>>On September 26, 2002 at 08:17:11, Vincent Diepeveen wrote: >>> >>>>On September 25, 2002 at 16:10:51, Robert Hyatt wrote: >>>> >>>>>On September 25, 2002 at 14:45:01, Vincent Diepeveen wrote: >>>>> >>>>>>On September 25, 2002 at 14:17:55, Robert Hyatt wrote: >>>>>> >>>>>>>On September 25, 2002 at 11:49:25, Russell Reagan wrote: >>>>>>> >>>>>>>>On September 25, 2002 at 08:10:06, Vincent Diepeveen wrote: >>>>>>>> >>>>>>>>>i cannot use select() at all as i limit myself to < 128 processor >>>>>>>>>partitions then. >>>>>>>> >>>>>>>>Does WaitForSingleObject or WaitForMultipleObjects allow you to use more than >>>>>>>>128 processors? >>>>>>>> >>>>>>>>>also i have no idea how to get it to work and whether it can do >>>>>>>>>it 400 times a second instantly. >>>>>>>> >>>>>>>>See the problems Microsoft causes? They always have to be different (in a bad >>>>>>>>evil kind of way). >>>>>>> >>>>>>> >>>>>>>The very concept of synchronizing > 128 processes with such a system call >>>>>>>defies any sort of logic I can think of. Doing it hundreds of times a second >>>>>>>only guarantees horrible performance. Doing this every now and then would >>>>>>>be ok. But not "400 times a second". There are other ways to eliminate that >>>>>>>kind of stuff... >>>>>> >>>>>>the first 10 ply you get out of hashtable with just 1 processor obviously. >>>>>> >>>>>>Only after that the other 511 can join in. they manage themselves after >>>>>>this. I don't want to let them poll while they are not searching. i have >>>>>>no option there. >>>>> >>>>>You simply don't understand memory architecture yet. "polling" works >>>>>perfectly. You wait on a pointer to become non-zero for example. While >>>>>you are "polling" it takes _no_ bandwidth because the value lands in your >>>>>local cache. All the cache controllers have to do is inform each other when >>>>>something changes. IE this is how the Intel boxes work and they do that quite >>>>>well. >>>>> >>>>>NUMA shouldn't change that basic idea... >>>> >>>>There is a good reason why it even goes wrong at NUMA. >>>> >>>>The basic problem is that also at the NUMA machines that each node >>>>is 2 processors and not 1, in the latter case you would be correct. >>> >>>This varies. I have run on an alpha with 4 cpus per node. I have run on >>>a xeon with 1 cpu per "node" (non-numa). Intel is building multiple NUMA >>>platforms, although it seems they are settling in to 4 cpus per node as the >>>basic building block, something like N-cube did years ago. >>> >>>> >>>>We can see this dual cpu as next: >>>> >>>> CPU1 ----- >>>> |--- HUB --| Memory banks for CPU1 & CPU2 >>>> CPU2 ----- | >>>> | >>>> SN0 router >>>> >>>>So if 1 cpu is hammering on a single cache line in local memory >>>>it is using the hub to access the memory banks. This is true for >>>>the R14000 design which has 2GB local memory at the memory banks >>>>and 8MB L2 cache. >>> >>>That should _not_ be working like that. If cpu is hammering on a single >>>cache line, it should not even be talking to the hub. It should be talking >>>to its _local_ cache. That's the point of what are often called "shadow" >> >>As you know one must prevent race conditions. Especially hardware >>race conditions. >> >>Supercomputers can therefore not do this PC thing Bob. > >VIncent, the machine you are using is _not_ a super-computer. It is a >collection of off-the-shelf MIPS microprocessors, built into a NUMA-based >cluster. > >Please use the _right_ term. A cray is a super-computer. Or a hitachi, or >a fujitsu. Not a MIPS-based machine. > >as far as this "pc" thing, _your_ company might not do it. _others_ do. >it is a well-known approach to cache coherency and _nobody_ tolerates race >conditions... > >> >>In order to prevent race conditions in hardware everything first talks to >>the hub, the hub then talks to the memory. >> >>In that way the SN0 routers do not face with race conditions versus processors >>poking directly into memory. >> >>So every memory reference goes through the hub. i say 'hub' because that's >>how SGI calls it. >> >>>operations. You make a "shadow" of something by putting it in cache, to get >>>off the system bus. that is how my spinlocks work on my quad xeon in fact: >>> >>># define Lock(p) while(exchange(&p,1)) while(p) >>> >>>The "exchange" is horribly slow on the intel box, because it locks the bus >>>down tight while reading/writing to accomplish the exchange. But it does it >>>in an atomic manner which is required for true locks. >> >>>however, if the exchange says "value is already non-zero" I then do a simple >>>while(p) which spins on the value of P in the L1 cache. No bus activity at >>>all while I am spinning. when _another_ processor writes to p, my cache >>>controller finds out and invalidates that cache line. that causes it to >>>do a cache line fill to get the new value. If that value is zero, the >>>while(P) terminates and we go back to try a real exchange(P,1) atomically. >> >>At this big supercomputer the hub is in between >>connected to the SN0 router which needs cache >>lines too in an atomic way. >> >>Locking in NUMA happens in the shared memory (see advice in the manuals) >>preferably as atomic operations take care then it goes ok. >> >>In this case we are not busy locking anyhow. We are simply trying to >>get a cache line over and over again and the processor can deliver that much >>requests a second that it provable in software keeps the SN0 or other cpu >>very busy. >> >>Obviously, your whole theory might have some truths for the pc. I'm not >>sure here. I'm not sure whether it is a smart idea to do it on the pc >>either, cache coherency or not. The chipset of the dual K7 can deliver >>also only 600MB/s (i do not see how this will soon change for big systems >>in the future either; the hammer seems to have a good alternative for up >>to 8 processors though; though i would be happy already if they manage >>to get it to work dual within a year or 2), in short the other cpu >>can't do a thing with the chipset when it's just busy poking into the >>other cpu. Using L2 cache coherency and/or chipset, i don't care. It's >>keeping it busy *somehow*. > > >No it isn't, but I am not going to waste a lot more time trying to explain >how modern multiprocessor machines take care of cache coherency. It is a >well-known issue, and it has been solved by everyone. And it doesn't require >that a cpu accessing data in its local cache to keep other cpus busy asking >"what are you doing now?" type questions... Find a good architecture book, >look for multiprocessor architectures and cache coherency in the index, and >dive in. It isn't that complicated... > > >> >>Any cpu can deliver up to 600MB/s easily, this is *no problem*. >> >>Suppose the next thing. We have 2 players in a field. one player is >>at maximum speed shipping tells to the other player, which MUST get answerred, >>I WANT TO KNOW WHAT HAPPENS ABOUT THIS. >> >>How they solve this in the PC i don't know, ignore it? > >They design _around_ it so that it doesn't happen... > > >> >>I know the intels are having a bit more primitive way of handling such >>requests than the supercomputers/amd chipset. > >When you run on a supercomputer, tell me more. I have run on every >"supercomputer" ever made. :) > >> >>Alpha, R14000, AMD they all use a master/slave idea. Each cpu can set >>ownership on a cache line. A great system. I know the intel Xeon chipset >>didn't take this into account. I guess the P4 Xeons don't either? > > >Nope, the intel approach makes much more sense, and, in fact, their approach >is used by _most_. It's a well-known problem with well-known solutions that >have been around for _years_... > > > >> >>Perhaps the lack of this advanced feature which is great if you take >>it into account, is the reason why the intel chipset is fast for you? > > >I don't have a clue what you are talking about... > > > >> >>>That is called a "shadow lock" and is _the_ solution to keep spinning processes >>>off the memory bus so that there is no interference with other processors. >> >>> >>> >>> >>>> >>>>However it can't be working only within the L2 cache, because it is >>>>a read straight from local memory. >>> >>>Then you don't have a real L2 cache there. Because the purpose of any >>>cache is to _avoid_ going to real memory whenever possible, whether it is >>>local or not. >>> >>>> >>>>Additionally in all supercomputer designs each cache line of memory >>>>has a number of bits added to it for who owns the memory cache line. >>>> >>>>You are correctly referring to the fact that if i do a write to the >>>>memory now, that cpu1 is only referencing to the local memory. >>>> >>>>So your assumption is correct if each node was a single cpu. >>>> >>>>However it is a dual and sometimes even a quad. This means in short >>>>that many cpu's, at least CPU2 is hammering also at the same time >>>>in the memory, and the bandwidth from this is only a fraction of >>>>what the 8MB L2 cache delivers. Every reference to main memory goes >>>>through the Hub simply which delivers 600MB/s effectively. >>> >>>However, if a node is a quad, then each processor had _better_ have a real >>>processor cache or the machine is _never_ going to perform, since the >>>router speed can't even feed one cpu fully, much less 4... Better check the >>>processor specs on that box as I can't imagine MIPS doing something that >>>poorly, based on past things they have produced. >>> >>>> >>>>We can both imagine that this is a major problem at supercomputers. >>>> >>>>Therefore the approach to let them idle while not doing a job is >>>>the right one. >>> >>> >>>I simply don't think they operate as you describe, otherwise they will >>>perform like dogs, period. Each CPU _must_ have a local processor cache, >>>or else a group of four cpus must have a 4-port local cache they share. >>>They have to have _something_ to avoid memory I/O even if it is local memory. >>>Memory is _always_ slow. Local or not. >>> >>> >>> >>> >>>> >>>>You are obviously correct if the system design would be each node >>>>is a single cpu based. That is not the case however. >>>> >>>>Another effect is that the SN0 router is also attached to the Hub, >>>>i will not describe the effects here as i have no theory to base myself >>>>upon, but from my experiments i saw that if one of the CPU's is poking >>>>into its own local memory, that the SN0 router is busy somehow too with >>>>things which slows down the entire search simply as it eats away >>>>bandwidth. >>>> >>>>My knowledge of what intelligent algorithms the SN0 router uses >>>>to predict things is very limited. Near to zero actually. But somehow >>>>i feel it is not a good idea to keep the SN0 router also busy indirectly >>>>with reads to the local memory while CPU2 is searching its own way. >>>> >>>>So the absolutely best solution to the problem is pretty trivial and that's >>>>letting processes that do not search idle simply. >>> >>> >>>What is wrong with the above is the definition of "idle". If _your_ process >>>is not running, what is the O/S doing? It is executing code all the time >>>itself, the famous "idle loop"... >>> >>> >>>> >>>>The real basic problem one is confronted with in general is that initially >>>>a n processor partition idles a lot more than a dual or quad cpu box is >>>>doing. >>>> >>>>So we both hardly feel the fact that at the PC we are poking into a single >>>>cache line. At the supers this is a different thing however. >>>> >>>>They do not like me wasting bandwidth at all :) >>>> >>>>> >>>>> >>>>> >>>>>> >>>>>>Just imagine how much bandwidth that takes away. >>>>>> >>>>>>The machine has 1 TB a second bandwidth. This 512p partition has >>>>>>about half of that (machine has 1024p in total). If all these procs are >>>>>>spinning around that will confuse the SN0 routers completely. >>>>> >>>>>Not with cache. >>>>> >>>>> >>>>>> >>>>>>From that 0.5 TB bandwidth about 0.25TB a second is reserved for harddisk >>>>>>i/o. I can't use it. I use the other 600MB/s which the hub delivers for >>>>>>2 processors for memory bandwidth. >>>>>> >>>>>>We all can imagine what happens if the hubs are only busy delivering reads >>>>>>to the RAM. >>>>>> >>>>>>Right now each processor spins at its own shared memory variables, >>>>>>that's simply *not* a good design idea for NUMA. It is not working for >>>>>>any machine actually above 8 procs. >>>>> >>>>>It would work fine on a Cray with 32... >>>>> >>>>> >>>>>> >>>>>>Also shared memory buses you can't work with that design at all.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.