Author: Robert Hyatt
Date: 20:50:28 09/27/02
Go up one level in this thread
On September 27, 2002 at 12:35:39, Vincent Diepeveen wrote: >On September 26, 2002 at 11:46:34, Robert Hyatt wrote: > >>On September 26, 2002 at 08:17:11, Vincent Diepeveen wrote: >> >>>On September 25, 2002 at 16:10:51, Robert Hyatt wrote: >>> >>>>On September 25, 2002 at 14:45:01, Vincent Diepeveen wrote: >>>> >>>>>On September 25, 2002 at 14:17:55, Robert Hyatt wrote: >>>>> >>>>>>On September 25, 2002 at 11:49:25, Russell Reagan wrote: >>>>>> >>>>>>>On September 25, 2002 at 08:10:06, Vincent Diepeveen wrote: >>>>>>> >>>>>>>>i cannot use select() at all as i limit myself to < 128 processor >>>>>>>>partitions then. >>>>>>> >>>>>>>Does WaitForSingleObject or WaitForMultipleObjects allow you to use more than >>>>>>>128 processors? >>>>>>> >>>>>>>>also i have no idea how to get it to work and whether it can do >>>>>>>>it 400 times a second instantly. >>>>>>> >>>>>>>See the problems Microsoft causes? They always have to be different (in a bad >>>>>>>evil kind of way). >>>>>> >>>>>> >>>>>>The very concept of synchronizing > 128 processes with such a system call >>>>>>defies any sort of logic I can think of. Doing it hundreds of times a second >>>>>>only guarantees horrible performance. Doing this every now and then would >>>>>>be ok. But not "400 times a second". There are other ways to eliminate that >>>>>>kind of stuff... >>>>> >>>>>the first 10 ply you get out of hashtable with just 1 processor obviously. >>>>> >>>>>Only after that the other 511 can join in. they manage themselves after >>>>>this. I don't want to let them poll while they are not searching. i have >>>>>no option there. >>>> >>>>You simply don't understand memory architecture yet. "polling" works >>>>perfectly. You wait on a pointer to become non-zero for example. While >>>>you are "polling" it takes _no_ bandwidth because the value lands in your >>>>local cache. All the cache controllers have to do is inform each other when >>>>something changes. IE this is how the Intel boxes work and they do that quite >>>>well. >>>> >>>>NUMA shouldn't change that basic idea... >>> >>>There is a good reason why it even goes wrong at NUMA. >>> >>>The basic problem is that also at the NUMA machines that each node >>>is 2 processors and not 1, in the latter case you would be correct. >> >>This varies. I have run on an alpha with 4 cpus per node. I have run on >>a xeon with 1 cpu per "node" (non-numa). Intel is building multiple NUMA >>platforms, although it seems they are settling in to 4 cpus per node as the >>basic building block, something like N-cube did years ago. >> >>> >>>We can see this dual cpu as next: >>> >>> CPU1 ----- >>> |--- HUB --| Memory banks for CPU1 & CPU2 >>> CPU2 ----- | >>> | >>> SN0 router >>> >>>So if 1 cpu is hammering on a single cache line in local memory >>>it is using the hub to access the memory banks. This is true for >>>the R14000 design which has 2GB local memory at the memory banks >>>and 8MB L2 cache. >> >>That should _not_ be working like that. If cpu is hammering on a single >>cache line, it should not even be talking to the hub. It should be talking >>to its _local_ cache. That's the point of what are often called "shadow" > >As you know one must prevent race conditions. Especially hardware >race conditions. > >Supercomputers can therefore not do this PC thing Bob. VIncent, the machine you are using is _not_ a super-computer. It is a collection of off-the-shelf MIPS microprocessors, built into a NUMA-based cluster. Please use the _right_ term. A cray is a super-computer. Or a hitachi, or a fujitsu. Not a MIPS-based machine. as far as this "pc" thing, _your_ company might not do it. _others_ do. it is a well-known approach to cache coherency and _nobody_ tolerates race conditions... > >In order to prevent race conditions in hardware everything first talks to >the hub, the hub then talks to the memory. > >In that way the SN0 routers do not face with race conditions versus processors >poking directly into memory. > >So every memory reference goes through the hub. i say 'hub' because that's >how SGI calls it. > >>operations. You make a "shadow" of something by putting it in cache, to get >>off the system bus. that is how my spinlocks work on my quad xeon in fact: >> >># define Lock(p) while(exchange(&p,1)) while(p) >> >>The "exchange" is horribly slow on the intel box, because it locks the bus >>down tight while reading/writing to accomplish the exchange. But it does it >>in an atomic manner which is required for true locks. > >>however, if the exchange says "value is already non-zero" I then do a simple >>while(p) which spins on the value of P in the L1 cache. No bus activity at >>all while I am spinning. when _another_ processor writes to p, my cache >>controller finds out and invalidates that cache line. that causes it to >>do a cache line fill to get the new value. If that value is zero, the >>while(P) terminates and we go back to try a real exchange(P,1) atomically. > >At this big supercomputer the hub is in between >connected to the SN0 router which needs cache >lines too in an atomic way. > >Locking in NUMA happens in the shared memory (see advice in the manuals) >preferably as atomic operations take care then it goes ok. > >In this case we are not busy locking anyhow. We are simply trying to >get a cache line over and over again and the processor can deliver that much >requests a second that it provable in software keeps the SN0 or other cpu >very busy. > >Obviously, your whole theory might have some truths for the pc. I'm not >sure here. I'm not sure whether it is a smart idea to do it on the pc >either, cache coherency or not. The chipset of the dual K7 can deliver >also only 600MB/s (i do not see how this will soon change for big systems >in the future either; the hammer seems to have a good alternative for up >to 8 processors though; though i would be happy already if they manage >to get it to work dual within a year or 2), in short the other cpu >can't do a thing with the chipset when it's just busy poking into the >other cpu. Using L2 cache coherency and/or chipset, i don't care. It's >keeping it busy *somehow*. No it isn't, but I am not going to waste a lot more time trying to explain how modern multiprocessor machines take care of cache coherency. It is a well-known issue, and it has been solved by everyone. And it doesn't require that a cpu accessing data in its local cache to keep other cpus busy asking "what are you doing now?" type questions... Find a good architecture book, look for multiprocessor architectures and cache coherency in the index, and dive in. It isn't that complicated... > >Any cpu can deliver up to 600MB/s easily, this is *no problem*. > >Suppose the next thing. We have 2 players in a field. one player is >at maximum speed shipping tells to the other player, which MUST get answerred, >I WANT TO KNOW WHAT HAPPENS ABOUT THIS. > >How they solve this in the PC i don't know, ignore it? They design _around_ it so that it doesn't happen... > >I know the intels are having a bit more primitive way of handling such >requests than the supercomputers/amd chipset. When you run on a supercomputer, tell me more. I have run on every "supercomputer" ever made. :) > >Alpha, R14000, AMD they all use a master/slave idea. Each cpu can set >ownership on a cache line. A great system. I know the intel Xeon chipset >didn't take this into account. I guess the P4 Xeons don't either? Nope, the intel approach makes much more sense, and, in fact, their approach is used by _most_. It's a well-known problem with well-known solutions that have been around for _years_... > >Perhaps the lack of this advanced feature which is great if you take >it into account, is the reason why the intel chipset is fast for you? I don't have a clue what you are talking about... > >>That is called a "shadow lock" and is _the_ solution to keep spinning processes >>off the memory bus so that there is no interference with other processors. > >> >> >> >>> >>>However it can't be working only within the L2 cache, because it is >>>a read straight from local memory. >> >>Then you don't have a real L2 cache there. Because the purpose of any >>cache is to _avoid_ going to real memory whenever possible, whether it is >>local or not. >> >>> >>>Additionally in all supercomputer designs each cache line of memory >>>has a number of bits added to it for who owns the memory cache line. >>> >>>You are correctly referring to the fact that if i do a write to the >>>memory now, that cpu1 is only referencing to the local memory. >>> >>>So your assumption is correct if each node was a single cpu. >>> >>>However it is a dual and sometimes even a quad. This means in short >>>that many cpu's, at least CPU2 is hammering also at the same time >>>in the memory, and the bandwidth from this is only a fraction of >>>what the 8MB L2 cache delivers. Every reference to main memory goes >>>through the Hub simply which delivers 600MB/s effectively. >> >>However, if a node is a quad, then each processor had _better_ have a real >>processor cache or the machine is _never_ going to perform, since the >>router speed can't even feed one cpu fully, much less 4... Better check the >>processor specs on that box as I can't imagine MIPS doing something that >>poorly, based on past things they have produced. >> >>> >>>We can both imagine that this is a major problem at supercomputers. >>> >>>Therefore the approach to let them idle while not doing a job is >>>the right one. >> >> >>I simply don't think they operate as you describe, otherwise they will >>perform like dogs, period. Each CPU _must_ have a local processor cache, >>or else a group of four cpus must have a 4-port local cache they share. >>They have to have _something_ to avoid memory I/O even if it is local memory. >>Memory is _always_ slow. Local or not. >> >> >> >> >>> >>>You are obviously correct if the system design would be each node >>>is a single cpu based. That is not the case however. >>> >>>Another effect is that the SN0 router is also attached to the Hub, >>>i will not describe the effects here as i have no theory to base myself >>>upon, but from my experiments i saw that if one of the CPU's is poking >>>into its own local memory, that the SN0 router is busy somehow too with >>>things which slows down the entire search simply as it eats away >>>bandwidth. >>> >>>My knowledge of what intelligent algorithms the SN0 router uses >>>to predict things is very limited. Near to zero actually. But somehow >>>i feel it is not a good idea to keep the SN0 router also busy indirectly >>>with reads to the local memory while CPU2 is searching its own way. >>> >>>So the absolutely best solution to the problem is pretty trivial and that's >>>letting processes that do not search idle simply. >> >> >>What is wrong with the above is the definition of "idle". If _your_ process >>is not running, what is the O/S doing? It is executing code all the time >>itself, the famous "idle loop"... >> >> >>> >>>The real basic problem one is confronted with in general is that initially >>>a n processor partition idles a lot more than a dual or quad cpu box is >>>doing. >>> >>>So we both hardly feel the fact that at the PC we are poking into a single >>>cache line. At the supers this is a different thing however. >>> >>>They do not like me wasting bandwidth at all :) >>> >>>> >>>> >>>> >>>>> >>>>>Just imagine how much bandwidth that takes away. >>>>> >>>>>The machine has 1 TB a second bandwidth. This 512p partition has >>>>>about half of that (machine has 1024p in total). If all these procs are >>>>>spinning around that will confuse the SN0 routers completely. >>>> >>>>Not with cache. >>>> >>>> >>>>> >>>>>From that 0.5 TB bandwidth about 0.25TB a second is reserved for harddisk >>>>>i/o. I can't use it. I use the other 600MB/s which the hub delivers for >>>>>2 processors for memory bandwidth. >>>>> >>>>>We all can imagine what happens if the hubs are only busy delivering reads >>>>>to the RAM. >>>>> >>>>>Right now each processor spins at its own shared memory variables, >>>>>that's simply *not* a good design idea for NUMA. It is not working for >>>>>any machine actually above 8 procs. >>>> >>>>It would work fine on a Cray with 32... >>>> >>>> >>>>> >>>>>Also shared memory buses you can't work with that design at all.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.