Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: UNIX question: WaitForSingleObject() under IRIX/Linux

Author: Robert Hyatt

Date: 20:50:28 09/27/02

Go up one level in this thread


On September 27, 2002 at 12:35:39, Vincent Diepeveen wrote:

>On September 26, 2002 at 11:46:34, Robert Hyatt wrote:
>
>>On September 26, 2002 at 08:17:11, Vincent Diepeveen wrote:
>>
>>>On September 25, 2002 at 16:10:51, Robert Hyatt wrote:
>>>
>>>>On September 25, 2002 at 14:45:01, Vincent Diepeveen wrote:
>>>>
>>>>>On September 25, 2002 at 14:17:55, Robert Hyatt wrote:
>>>>>
>>>>>>On September 25, 2002 at 11:49:25, Russell Reagan wrote:
>>>>>>
>>>>>>>On September 25, 2002 at 08:10:06, Vincent Diepeveen wrote:
>>>>>>>
>>>>>>>>i cannot use select() at all as i limit myself to  < 128 processor
>>>>>>>>partitions then.
>>>>>>>
>>>>>>>Does WaitForSingleObject or WaitForMultipleObjects allow you to use more than
>>>>>>>128 processors?
>>>>>>>
>>>>>>>>also i have no idea how to get it to work and whether it can do
>>>>>>>>it 400 times a second instantly.
>>>>>>>
>>>>>>>See the problems Microsoft causes? They always have to be different (in a bad
>>>>>>>evil kind of way).
>>>>>>
>>>>>>
>>>>>>The very concept of synchronizing > 128 processes with such a system call
>>>>>>defies any sort of logic I can think of.  Doing it hundreds of times a second
>>>>>>only guarantees horrible performance.  Doing this every now and then would
>>>>>>be ok.  But not "400 times a second".  There are other ways to eliminate that
>>>>>>kind of stuff...
>>>>>
>>>>>the first 10 ply you get out of hashtable with just 1 processor obviously.
>>>>>
>>>>>Only after that the other 511 can join in. they manage themselves after
>>>>>this. I don't want to let them poll while they are not searching. i have
>>>>>no option there.
>>>>
>>>>You simply don't understand memory architecture yet.  "polling" works
>>>>perfectly.  You wait on a pointer to become non-zero for example.  While
>>>>you are "polling" it takes _no_ bandwidth because the value lands in your
>>>>local cache.  All the cache controllers have to do is inform each other when
>>>>something changes.  IE this is how the Intel boxes work and they do that quite
>>>>well.
>>>>
>>>>NUMA shouldn't change that basic idea...
>>>
>>>There is a good reason why it even goes wrong at NUMA.
>>>
>>>The basic problem is that also at the NUMA machines that each node
>>>is 2 processors and not 1, in the latter case you would be correct.
>>
>>This varies.  I have run on an alpha with 4 cpus per node.  I have run on
>>a xeon with 1 cpu per "node" (non-numa).  Intel is building multiple NUMA
>>platforms, although it seems they are settling in to 4 cpus per node as the
>>basic building block, something like N-cube did years ago.
>>
>>>
>>>We can see this dual cpu as next:
>>>
>>>  CPU1 -----
>>>           |--- HUB --| Memory banks for CPU1 & CPU2
>>>  CPU2 -----     |
>>>                 |
>>>                SN0 router
>>>
>>>So if 1 cpu is hammering on a single cache line in local memory
>>>it is using the hub to access the memory banks. This is true for
>>>the R14000 design which has 2GB local memory at the memory banks
>>>and 8MB L2 cache.
>>
>>That should _not_ be working like that.  If cpu is hammering on a single
>>cache line, it should not even be talking to the hub.  It should be talking
>>to its _local_ cache.  That's the point of what are often called "shadow"
>
>As you know one must prevent race conditions. Especially hardware
>race conditions.
>
>Supercomputers can therefore not do this PC thing Bob.

VIncent,  the machine you are using is _not_ a super-computer.  It is a
collection of off-the-shelf MIPS microprocessors, built into a NUMA-based
cluster.

Please use the _right_ term.  A cray is a super-computer.  Or a hitachi, or
a fujitsu.  Not a MIPS-based machine.

as far as this "pc" thing, _your_ company might not do it.  _others_ do.
it is a well-known approach to cache coherency and _nobody_ tolerates race
conditions...

>
>In order to prevent race conditions in hardware everything first talks to
>the hub, the hub then talks to the memory.
>
>In that way the SN0 routers do not face with race conditions versus processors
>poking directly into memory.
>
>So every memory reference goes through the hub. i say 'hub' because that's
>how SGI calls it.
>
>>operations.  You make a "shadow" of something by putting it in cache, to get
>>off the system bus.  that is how my spinlocks work on my quad xeon in fact:
>>
>>#  define Lock(p)               while(exchange(&p,1)) while(p)
>>
>>The "exchange" is horribly slow on the intel box, because it locks the bus
>>down tight while reading/writing to accomplish the exchange.  But it does it
>>in an atomic manner which is required for true locks.
>
>>however, if the exchange says "value is already non-zero" I then do a simple
>>while(p) which spins on the value of P in the L1 cache.  No bus activity at
>>all while I am spinning.  when _another_ processor writes to p, my cache
>>controller finds out and invalidates that cache line.  that causes it to
>>do a cache line fill to get the new value.  If that value is zero, the
>>while(P) terminates and we go back to try a real exchange(P,1) atomically.
>
>At this big supercomputer the hub is in between
>connected to the SN0 router which needs cache
>lines too in an atomic way.
>
>Locking in NUMA happens in the shared memory (see advice in the manuals)
>preferably as atomic operations take care then it goes ok.
>
>In this case we are not busy locking anyhow. We are simply trying to
>get a cache line over and over again and the processor can deliver that much
>requests a second that it provable in software keeps the SN0 or other cpu
>very busy.
>
>Obviously, your whole theory might have some truths for the pc. I'm not
>sure here. I'm not sure whether it is a smart idea to do it on the pc
>either, cache coherency or not. The chipset of the dual K7 can deliver
>also only 600MB/s (i do not see how this will soon change for big systems
>in the future either; the hammer seems to have a good alternative for up
>to 8 processors though; though i would be happy already if they manage
>to get it to work dual within a year or 2), in short the other cpu
>can't do a thing with the chipset when it's just busy poking into the
>other cpu. Using L2 cache coherency and/or chipset, i don't care. It's
>keeping it busy *somehow*.


No it isn't, but I am not going to waste a lot more time trying to explain
how modern multiprocessor machines take care of cache coherency.  It is a
well-known issue, and it has been solved by everyone.  And it doesn't require
that a cpu accessing data in its local cache to keep other cpus busy asking
"what are you doing now?" type questions...  Find a good architecture book,
look for multiprocessor architectures and cache coherency in the index, and
dive in.  It isn't that complicated...


>
>Any cpu can deliver up to 600MB/s easily, this is *no problem*.
>
>Suppose the next thing. We have 2 players in a field. one player is
>at maximum speed shipping tells to the other player, which MUST get answerred,
>I WANT TO KNOW WHAT HAPPENS ABOUT THIS.
>
>How they solve this in the PC i don't know, ignore it?

They design _around_ it so that it doesn't happen...


>
>I know the intels are having a bit more primitive way of handling such
>requests than the supercomputers/amd chipset.

When you run on a supercomputer, tell me more.  I have run on every
"supercomputer" ever made.  :)

>
>Alpha, R14000, AMD they all use a master/slave idea. Each cpu can set
>ownership on a cache line. A great system. I know the intel Xeon chipset
>didn't take this into account. I guess the P4 Xeons don't either?


Nope, the intel approach makes much more sense, and, in fact, their approach
is used by _most_.  It's a well-known problem with well-known solutions that
have been around for _years_...



>
>Perhaps the lack of this advanced feature which is great if you take
>it into account, is the reason why the intel chipset is fast for you?


I don't have a clue what you are talking about...



>
>>That is called a "shadow lock" and is _the_ solution to keep spinning processes
>>off the memory bus so that there is no interference with other processors.
>
>>
>>
>>
>>>
>>>However it can't be working only within the L2 cache, because it is
>>>a read straight from local memory.
>>
>>Then you don't have a real L2 cache there.  Because the purpose of any
>>cache is to _avoid_ going to real memory whenever possible, whether it is
>>local or not.
>>
>>>
>>>Additionally in all supercomputer designs each cache line of memory
>>>has a number of bits added to it for who owns the memory cache line.
>>>
>>>You are correctly referring to the fact that if i do a write to the
>>>memory now, that cpu1 is only referencing to the local memory.
>>>
>>>So your assumption is correct if each node was a single cpu.
>>>
>>>However it is a dual and sometimes even a quad. This means in short
>>>that many cpu's, at least CPU2 is hammering also at the same time
>>>in the memory, and the bandwidth from this is only a fraction of
>>>what the 8MB L2 cache delivers. Every reference to main memory goes
>>>through the Hub simply which delivers 600MB/s effectively.
>>
>>However, if a node is a quad, then each processor had _better_ have a real
>>processor cache or the machine is _never_ going to perform, since the
>>router speed can't even feed one cpu fully, much less 4...  Better check the
>>processor specs on that box as I can't imagine MIPS doing something that
>>poorly, based on past things they have produced.
>>
>>>
>>>We can both imagine that this is a major problem at supercomputers.
>>>
>>>Therefore the approach to let them idle while not doing a job is
>>>the right one.
>>
>>
>>I simply don't think they operate as you describe, otherwise they will
>>perform like dogs, period.  Each CPU _must_ have a local processor cache,
>>or else a group of four cpus must have a 4-port local cache they share.
>>They have to have _something_ to avoid memory I/O even if it is local memory.
>>Memory is _always_ slow.  Local or not.
>>
>>
>>
>>
>>>
>>>You are obviously correct if the system design would be each node
>>>is a single cpu based. That is not the case however.
>>>
>>>Another effect is that the SN0 router is also attached to the Hub,
>>>i will not describe the effects here as i have no theory to base myself
>>>upon, but from my experiments i saw that if one of the CPU's is poking
>>>into its own local memory, that the SN0 router is busy somehow too with
>>>things which slows down the entire search simply as it eats away
>>>bandwidth.
>>>
>>>My knowledge of what intelligent algorithms the SN0 router uses
>>>to predict things is very limited. Near to zero actually. But somehow
>>>i feel it is not a good idea to keep the SN0 router also busy indirectly
>>>with reads to the local memory while CPU2 is searching its own way.
>>>
>>>So the absolutely best solution to the problem is pretty trivial and that's
>>>letting processes that do not search idle simply.
>>
>>
>>What is wrong with the above is the definition of "idle".  If _your_ process
>>is not running, what is the O/S doing?  It is executing code all the time
>>itself, the famous "idle loop"...
>>
>>
>>>
>>>The real basic problem one is confronted with in general is that initially
>>>a n processor partition idles a lot more than a dual or quad cpu box is
>>>doing.
>>>
>>>So we both hardly feel the fact that at the PC we are poking into a single
>>>cache line. At the supers this is a different thing however.
>>>
>>>They do not like me wasting bandwidth at all :)
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>Just imagine how much bandwidth that takes away.
>>>>>
>>>>>The machine has 1 TB a second bandwidth. This 512p partition has
>>>>>about half of that (machine has 1024p in total). If all these procs are
>>>>>spinning around that will confuse the SN0 routers completely.
>>>>
>>>>Not with cache.
>>>>
>>>>
>>>>>
>>>>>From that 0.5 TB bandwidth about 0.25TB a second is reserved for harddisk
>>>>>i/o. I can't use it. I use the other 600MB/s which the hub delivers for
>>>>>2 processors for memory bandwidth.
>>>>>
>>>>>We all can imagine what happens if the hubs are only busy delivering reads
>>>>>to the RAM.
>>>>>
>>>>>Right now each processor spins at its own shared memory variables,
>>>>>that's simply *not* a good design idea for NUMA. It is not working for
>>>>>any machine actually above 8 procs.
>>>>
>>>>It would work fine on a Cray with 32...
>>>>
>>>>
>>>>>
>>>>>Also shared memory buses you can't work with that design at all.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.