Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: UNIX question: WaitForSingleObject() under IRIX/Linux

Author: Vincent Diepeveen

Date: 08:26:28 09/29/02

Go up one level in this thread


On September 27, 2002 at 23:50:28, Robert Hyatt wrote:

www.top500.org to see this supercomputer and many others.

obviously a supercomputer is defined by having huge bandwidth
especially for i/o. 1 terabyte a second is enough?

anyway this is distracting again.

your definition of a supercomputer is very limited to your own
programming capabilities!!

>On September 27, 2002 at 12:35:39, Vincent Diepeveen wrote:
>
>>On September 26, 2002 at 11:46:34, Robert Hyatt wrote:
>>
>>>On September 26, 2002 at 08:17:11, Vincent Diepeveen wrote:
>>>
>>>>On September 25, 2002 at 16:10:51, Robert Hyatt wrote:
>>>>
>>>>>On September 25, 2002 at 14:45:01, Vincent Diepeveen wrote:
>>>>>
>>>>>>On September 25, 2002 at 14:17:55, Robert Hyatt wrote:
>>>>>>
>>>>>>>On September 25, 2002 at 11:49:25, Russell Reagan wrote:
>>>>>>>
>>>>>>>>On September 25, 2002 at 08:10:06, Vincent Diepeveen wrote:
>>>>>>>>
>>>>>>>>>i cannot use select() at all as i limit myself to  < 128 processor
>>>>>>>>>partitions then.
>>>>>>>>
>>>>>>>>Does WaitForSingleObject or WaitForMultipleObjects allow you to use more than
>>>>>>>>128 processors?
>>>>>>>>
>>>>>>>>>also i have no idea how to get it to work and whether it can do
>>>>>>>>>it 400 times a second instantly.
>>>>>>>>
>>>>>>>>See the problems Microsoft causes? They always have to be different (in a bad
>>>>>>>>evil kind of way).
>>>>>>>
>>>>>>>
>>>>>>>The very concept of synchronizing > 128 processes with such a system call
>>>>>>>defies any sort of logic I can think of.  Doing it hundreds of times a second
>>>>>>>only guarantees horrible performance.  Doing this every now and then would
>>>>>>>be ok.  But not "400 times a second".  There are other ways to eliminate that
>>>>>>>kind of stuff...
>>>>>>
>>>>>>the first 10 ply you get out of hashtable with just 1 processor obviously.
>>>>>>
>>>>>>Only after that the other 511 can join in. they manage themselves after
>>>>>>this. I don't want to let them poll while they are not searching. i have
>>>>>>no option there.
>>>>>
>>>>>You simply don't understand memory architecture yet.  "polling" works
>>>>>perfectly.  You wait on a pointer to become non-zero for example.  While
>>>>>you are "polling" it takes _no_ bandwidth because the value lands in your
>>>>>local cache.  All the cache controllers have to do is inform each other when
>>>>>something changes.  IE this is how the Intel boxes work and they do that quite
>>>>>well.
>>>>>
>>>>>NUMA shouldn't change that basic idea...
>>>>
>>>>There is a good reason why it even goes wrong at NUMA.
>>>>
>>>>The basic problem is that also at the NUMA machines that each node
>>>>is 2 processors and not 1, in the latter case you would be correct.
>>>
>>>This varies.  I have run on an alpha with 4 cpus per node.  I have run on
>>>a xeon with 1 cpu per "node" (non-numa).  Intel is building multiple NUMA
>>>platforms, although it seems they are settling in to 4 cpus per node as the
>>>basic building block, something like N-cube did years ago.
>>>
>>>>
>>>>We can see this dual cpu as next:
>>>>
>>>>  CPU1 -----
>>>>           |--- HUB --| Memory banks for CPU1 & CPU2
>>>>  CPU2 -----     |
>>>>                 |
>>>>                SN0 router
>>>>
>>>>So if 1 cpu is hammering on a single cache line in local memory
>>>>it is using the hub to access the memory banks. This is true for
>>>>the R14000 design which has 2GB local memory at the memory banks
>>>>and 8MB L2 cache.
>>>
>>>That should _not_ be working like that.  If cpu is hammering on a single
>>>cache line, it should not even be talking to the hub.  It should be talking
>>>to its _local_ cache.  That's the point of what are often called "shadow"
>>
>>As you know one must prevent race conditions. Especially hardware
>>race conditions.
>>
>>Supercomputers can therefore not do this PC thing Bob.
>
>VIncent,  the machine you are using is _not_ a super-computer.  It is a
>collection of off-the-shelf MIPS microprocessors, built into a NUMA-based
>cluster.
>
>Please use the _right_ term.  A cray is a super-computer.  Or a hitachi, or
>a fujitsu.  Not a MIPS-based machine.
>
>as far as this "pc" thing, _your_ company might not do it.  _others_ do.
>it is a well-known approach to cache coherency and _nobody_ tolerates race
>conditions...
>
>>
>>In order to prevent race conditions in hardware everything first talks to
>>the hub, the hub then talks to the memory.
>>
>>In that way the SN0 routers do not face with race conditions versus processors
>>poking directly into memory.
>>
>>So every memory reference goes through the hub. i say 'hub' because that's
>>how SGI calls it.
>>
>>>operations.  You make a "shadow" of something by putting it in cache, to get
>>>off the system bus.  that is how my spinlocks work on my quad xeon in fact:
>>>
>>>#  define Lock(p)               while(exchange(&p,1)) while(p)
>>>
>>>The "exchange" is horribly slow on the intel box, because it locks the bus
>>>down tight while reading/writing to accomplish the exchange.  But it does it
>>>in an atomic manner which is required for true locks.
>>
>>>however, if the exchange says "value is already non-zero" I then do a simple
>>>while(p) which spins on the value of P in the L1 cache.  No bus activity at
>>>all while I am spinning.  when _another_ processor writes to p, my cache
>>>controller finds out and invalidates that cache line.  that causes it to
>>>do a cache line fill to get the new value.  If that value is zero, the
>>>while(P) terminates and we go back to try a real exchange(P,1) atomically.
>>
>>At this big supercomputer the hub is in between
>>connected to the SN0 router which needs cache
>>lines too in an atomic way.
>>
>>Locking in NUMA happens in the shared memory (see advice in the manuals)
>>preferably as atomic operations take care then it goes ok.
>>
>>In this case we are not busy locking anyhow. We are simply trying to
>>get a cache line over and over again and the processor can deliver that much
>>requests a second that it provable in software keeps the SN0 or other cpu
>>very busy.
>>
>>Obviously, your whole theory might have some truths for the pc. I'm not
>>sure here. I'm not sure whether it is a smart idea to do it on the pc
>>either, cache coherency or not. The chipset of the dual K7 can deliver
>>also only 600MB/s (i do not see how this will soon change for big systems
>>in the future either; the hammer seems to have a good alternative for up
>>to 8 processors though; though i would be happy already if they manage
>>to get it to work dual within a year or 2), in short the other cpu
>>can't do a thing with the chipset when it's just busy poking into the
>>other cpu. Using L2 cache coherency and/or chipset, i don't care. It's
>>keeping it busy *somehow*.
>
>
>No it isn't, but I am not going to waste a lot more time trying to explain
>how modern multiprocessor machines take care of cache coherency.  It is a
>well-known issue, and it has been solved by everyone.  And it doesn't require
>that a cpu accessing data in its local cache to keep other cpus busy asking
>"what are you doing now?" type questions...  Find a good architecture book,
>look for multiprocessor architectures and cache coherency in the index, and
>dive in.  It isn't that complicated...
>
>
>>
>>Any cpu can deliver up to 600MB/s easily, this is *no problem*.
>>
>>Suppose the next thing. We have 2 players in a field. one player is
>>at maximum speed shipping tells to the other player, which MUST get answerred,
>>I WANT TO KNOW WHAT HAPPENS ABOUT THIS.
>>
>>How they solve this in the PC i don't know, ignore it?
>
>They design _around_ it so that it doesn't happen...
>
>
>>
>>I know the intels are having a bit more primitive way of handling such
>>requests than the supercomputers/amd chipset.
>
>When you run on a supercomputer, tell me more.  I have run on every
>"supercomputer" ever made.  :)
>
>>
>>Alpha, R14000, AMD they all use a master/slave idea. Each cpu can set
>>ownership on a cache line. A great system. I know the intel Xeon chipset
>>didn't take this into account. I guess the P4 Xeons don't either?
>
>
>Nope, the intel approach makes much more sense, and, in fact, their approach
>is used by _most_.  It's a well-known problem with well-known solutions that
>have been around for _years_...
>
>
>
>>
>>Perhaps the lack of this advanced feature which is great if you take
>>it into account, is the reason why the intel chipset is fast for you?
>
>
>I don't have a clue what you are talking about...
>
>
>
>>
>>>That is called a "shadow lock" and is _the_ solution to keep spinning processes
>>>off the memory bus so that there is no interference with other processors.
>>
>>>
>>>
>>>
>>>>
>>>>However it can't be working only within the L2 cache, because it is
>>>>a read straight from local memory.
>>>
>>>Then you don't have a real L2 cache there.  Because the purpose of any
>>>cache is to _avoid_ going to real memory whenever possible, whether it is
>>>local or not.
>>>
>>>>
>>>>Additionally in all supercomputer designs each cache line of memory
>>>>has a number of bits added to it for who owns the memory cache line.
>>>>
>>>>You are correctly referring to the fact that if i do a write to the
>>>>memory now, that cpu1 is only referencing to the local memory.
>>>>
>>>>So your assumption is correct if each node was a single cpu.
>>>>
>>>>However it is a dual and sometimes even a quad. This means in short
>>>>that many cpu's, at least CPU2 is hammering also at the same time
>>>>in the memory, and the bandwidth from this is only a fraction of
>>>>what the 8MB L2 cache delivers. Every reference to main memory goes
>>>>through the Hub simply which delivers 600MB/s effectively.
>>>
>>>However, if a node is a quad, then each processor had _better_ have a real
>>>processor cache or the machine is _never_ going to perform, since the
>>>router speed can't even feed one cpu fully, much less 4...  Better check the
>>>processor specs on that box as I can't imagine MIPS doing something that
>>>poorly, based on past things they have produced.
>>>
>>>>
>>>>We can both imagine that this is a major problem at supercomputers.
>>>>
>>>>Therefore the approach to let them idle while not doing a job is
>>>>the right one.
>>>
>>>
>>>I simply don't think they operate as you describe, otherwise they will
>>>perform like dogs, period.  Each CPU _must_ have a local processor cache,
>>>or else a group of four cpus must have a 4-port local cache they share.
>>>They have to have _something_ to avoid memory I/O even if it is local memory.
>>>Memory is _always_ slow.  Local or not.
>>>
>>>
>>>
>>>
>>>>
>>>>You are obviously correct if the system design would be each node
>>>>is a single cpu based. That is not the case however.
>>>>
>>>>Another effect is that the SN0 router is also attached to the Hub,
>>>>i will not describe the effects here as i have no theory to base myself
>>>>upon, but from my experiments i saw that if one of the CPU's is poking
>>>>into its own local memory, that the SN0 router is busy somehow too with
>>>>things which slows down the entire search simply as it eats away
>>>>bandwidth.
>>>>
>>>>My knowledge of what intelligent algorithms the SN0 router uses
>>>>to predict things is very limited. Near to zero actually. But somehow
>>>>i feel it is not a good idea to keep the SN0 router also busy indirectly
>>>>with reads to the local memory while CPU2 is searching its own way.
>>>>
>>>>So the absolutely best solution to the problem is pretty trivial and that's
>>>>letting processes that do not search idle simply.
>>>
>>>
>>>What is wrong with the above is the definition of "idle".  If _your_ process
>>>is not running, what is the O/S doing?  It is executing code all the time
>>>itself, the famous "idle loop"...
>>>
>>>
>>>>
>>>>The real basic problem one is confronted with in general is that initially
>>>>a n processor partition idles a lot more than a dual or quad cpu box is
>>>>doing.
>>>>
>>>>So we both hardly feel the fact that at the PC we are poking into a single
>>>>cache line. At the supers this is a different thing however.
>>>>
>>>>They do not like me wasting bandwidth at all :)
>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>Just imagine how much bandwidth that takes away.
>>>>>>
>>>>>>The machine has 1 TB a second bandwidth. This 512p partition has
>>>>>>about half of that (machine has 1024p in total). If all these procs are
>>>>>>spinning around that will confuse the SN0 routers completely.
>>>>>
>>>>>Not with cache.
>>>>>
>>>>>
>>>>>>
>>>>>>From that 0.5 TB bandwidth about 0.25TB a second is reserved for harddisk
>>>>>>i/o. I can't use it. I use the other 600MB/s which the hub delivers for
>>>>>>2 processors for memory bandwidth.
>>>>>>
>>>>>>We all can imagine what happens if the hubs are only busy delivering reads
>>>>>>to the RAM.
>>>>>>
>>>>>>Right now each processor spins at its own shared memory variables,
>>>>>>that's simply *not* a good design idea for NUMA. It is not working for
>>>>>>any machine actually above 8 procs.
>>>>>
>>>>>It would work fine on a Cray with 32...
>>>>>
>>>>>
>>>>>>
>>>>>>Also shared memory buses you can't work with that design at all.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.