Author: Robert Hyatt
Date: 08:46:34 09/26/02
Go up one level in this thread
On September 26, 2002 at 08:17:11, Vincent Diepeveen wrote: >On September 25, 2002 at 16:10:51, Robert Hyatt wrote: > >>On September 25, 2002 at 14:45:01, Vincent Diepeveen wrote: >> >>>On September 25, 2002 at 14:17:55, Robert Hyatt wrote: >>> >>>>On September 25, 2002 at 11:49:25, Russell Reagan wrote: >>>> >>>>>On September 25, 2002 at 08:10:06, Vincent Diepeveen wrote: >>>>> >>>>>>i cannot use select() at all as i limit myself to < 128 processor >>>>>>partitions then. >>>>> >>>>>Does WaitForSingleObject or WaitForMultipleObjects allow you to use more than >>>>>128 processors? >>>>> >>>>>>also i have no idea how to get it to work and whether it can do >>>>>>it 400 times a second instantly. >>>>> >>>>>See the problems Microsoft causes? They always have to be different (in a bad >>>>>evil kind of way). >>>> >>>> >>>>The very concept of synchronizing > 128 processes with such a system call >>>>defies any sort of logic I can think of. Doing it hundreds of times a second >>>>only guarantees horrible performance. Doing this every now and then would >>>>be ok. But not "400 times a second". There are other ways to eliminate that >>>>kind of stuff... >>> >>>the first 10 ply you get out of hashtable with just 1 processor obviously. >>> >>>Only after that the other 511 can join in. they manage themselves after >>>this. I don't want to let them poll while they are not searching. i have >>>no option there. >> >>You simply don't understand memory architecture yet. "polling" works >>perfectly. You wait on a pointer to become non-zero for example. While >>you are "polling" it takes _no_ bandwidth because the value lands in your >>local cache. All the cache controllers have to do is inform each other when >>something changes. IE this is how the Intel boxes work and they do that quite >>well. >> >>NUMA shouldn't change that basic idea... > >There is a good reason why it even goes wrong at NUMA. > >The basic problem is that also at the NUMA machines that each node >is 2 processors and not 1, in the latter case you would be correct. This varies. I have run on an alpha with 4 cpus per node. I have run on a xeon with 1 cpu per "node" (non-numa). Intel is building multiple NUMA platforms, although it seems they are settling in to 4 cpus per node as the basic building block, something like N-cube did years ago. > >We can see this dual cpu as next: > > CPU1 ----- > |--- HUB --| Memory banks for CPU1 & CPU2 > CPU2 ----- | > | > SN0 router > >So if 1 cpu is hammering on a single cache line in local memory >it is using the hub to access the memory banks. This is true for >the R14000 design which has 2GB local memory at the memory banks >and 8MB L2 cache. That should _not_ be working like that. If cpu is hammering on a single cache line, it should not even be talking to the hub. It should be talking to its _local_ cache. That's the point of what are often called "shadow" operations. You make a "shadow" of something by putting it in cache, to get off the system bus. that is how my spinlocks work on my quad xeon in fact: # define Lock(p) while(exchange(&p,1)) while(p) The "exchange" is horribly slow on the intel box, because it locks the bus down tight while reading/writing to accomplish the exchange. But it does it in an atomic manner which is required for true locks. however, if the exchange says "value is already non-zero" I then do a simple while(p) which spins on the value of P in the L1 cache. No bus activity at all while I am spinning. when _another_ processor writes to p, my cache controller finds out and invalidates that cache line. that causes it to do a cache line fill to get the new value. If that value is zero, the while(P) terminates and we go back to try a real exchange(P,1) atomically. That is called a "shadow lock" and is _the_ solution to keep spinning processes off the memory bus so that there is no interference with other processors. > >However it can't be working only within the L2 cache, because it is >a read straight from local memory. Then you don't have a real L2 cache there. Because the purpose of any cache is to _avoid_ going to real memory whenever possible, whether it is local or not. > >Additionally in all supercomputer designs each cache line of memory >has a number of bits added to it for who owns the memory cache line. > >You are correctly referring to the fact that if i do a write to the >memory now, that cpu1 is only referencing to the local memory. > >So your assumption is correct if each node was a single cpu. > >However it is a dual and sometimes even a quad. This means in short >that many cpu's, at least CPU2 is hammering also at the same time >in the memory, and the bandwidth from this is only a fraction of >what the 8MB L2 cache delivers. Every reference to main memory goes >through the Hub simply which delivers 600MB/s effectively. However, if a node is a quad, then each processor had _better_ have a real processor cache or the machine is _never_ going to perform, since the router speed can't even feed one cpu fully, much less 4... Better check the processor specs on that box as I can't imagine MIPS doing something that poorly, based on past things they have produced. > >We can both imagine that this is a major problem at supercomputers. > >Therefore the approach to let them idle while not doing a job is >the right one. I simply don't think they operate as you describe, otherwise they will perform like dogs, period. Each CPU _must_ have a local processor cache, or else a group of four cpus must have a 4-port local cache they share. They have to have _something_ to avoid memory I/O even if it is local memory. Memory is _always_ slow. Local or not. > >You are obviously correct if the system design would be each node >is a single cpu based. That is not the case however. > >Another effect is that the SN0 router is also attached to the Hub, >i will not describe the effects here as i have no theory to base myself >upon, but from my experiments i saw that if one of the CPU's is poking >into its own local memory, that the SN0 router is busy somehow too with >things which slows down the entire search simply as it eats away >bandwidth. > >My knowledge of what intelligent algorithms the SN0 router uses >to predict things is very limited. Near to zero actually. But somehow >i feel it is not a good idea to keep the SN0 router also busy indirectly >with reads to the local memory while CPU2 is searching its own way. > >So the absolutely best solution to the problem is pretty trivial and that's >letting processes that do not search idle simply. What is wrong with the above is the definition of "idle". If _your_ process is not running, what is the O/S doing? It is executing code all the time itself, the famous "idle loop"... > >The real basic problem one is confronted with in general is that initially >a n processor partition idles a lot more than a dual or quad cpu box is >doing. > >So we both hardly feel the fact that at the PC we are poking into a single >cache line. At the supers this is a different thing however. > >They do not like me wasting bandwidth at all :) > >> >> >> >>> >>>Just imagine how much bandwidth that takes away. >>> >>>The machine has 1 TB a second bandwidth. This 512p partition has >>>about half of that (machine has 1024p in total). If all these procs are >>>spinning around that will confuse the SN0 routers completely. >> >>Not with cache. >> >> >>> >>>From that 0.5 TB bandwidth about 0.25TB a second is reserved for harddisk >>>i/o. I can't use it. I use the other 600MB/s which the hub delivers for >>>2 processors for memory bandwidth. >>> >>>We all can imagine what happens if the hubs are only busy delivering reads >>>to the RAM. >>> >>>Right now each processor spins at its own shared memory variables, >>>that's simply *not* a good design idea for NUMA. It is not working for >>>any machine actually above 8 procs. >> >>It would work fine on a Cray with 32... >> >> >>> >>>Also shared memory buses you can't work with that design at all.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.