Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: UNIX question: WaitForSingleObject() under IRIX/Linux

Author: Robert Hyatt

Date: 08:46:34 09/26/02

Go up one level in this thread


On September 26, 2002 at 08:17:11, Vincent Diepeveen wrote:

>On September 25, 2002 at 16:10:51, Robert Hyatt wrote:
>
>>On September 25, 2002 at 14:45:01, Vincent Diepeveen wrote:
>>
>>>On September 25, 2002 at 14:17:55, Robert Hyatt wrote:
>>>
>>>>On September 25, 2002 at 11:49:25, Russell Reagan wrote:
>>>>
>>>>>On September 25, 2002 at 08:10:06, Vincent Diepeveen wrote:
>>>>>
>>>>>>i cannot use select() at all as i limit myself to  < 128 processor
>>>>>>partitions then.
>>>>>
>>>>>Does WaitForSingleObject or WaitForMultipleObjects allow you to use more than
>>>>>128 processors?
>>>>>
>>>>>>also i have no idea how to get it to work and whether it can do
>>>>>>it 400 times a second instantly.
>>>>>
>>>>>See the problems Microsoft causes? They always have to be different (in a bad
>>>>>evil kind of way).
>>>>
>>>>
>>>>The very concept of synchronizing > 128 processes with such a system call
>>>>defies any sort of logic I can think of.  Doing it hundreds of times a second
>>>>only guarantees horrible performance.  Doing this every now and then would
>>>>be ok.  But not "400 times a second".  There are other ways to eliminate that
>>>>kind of stuff...
>>>
>>>the first 10 ply you get out of hashtable with just 1 processor obviously.
>>>
>>>Only after that the other 511 can join in. they manage themselves after
>>>this. I don't want to let them poll while they are not searching. i have
>>>no option there.
>>
>>You simply don't understand memory architecture yet.  "polling" works
>>perfectly.  You wait on a pointer to become non-zero for example.  While
>>you are "polling" it takes _no_ bandwidth because the value lands in your
>>local cache.  All the cache controllers have to do is inform each other when
>>something changes.  IE this is how the Intel boxes work and they do that quite
>>well.
>>
>>NUMA shouldn't change that basic idea...
>
>There is a good reason why it even goes wrong at NUMA.
>
>The basic problem is that also at the NUMA machines that each node
>is 2 processors and not 1, in the latter case you would be correct.

This varies.  I have run on an alpha with 4 cpus per node.  I have run on
a xeon with 1 cpu per "node" (non-numa).  Intel is building multiple NUMA
platforms, although it seems they are settling in to 4 cpus per node as the
basic building block, something like N-cube did years ago.

>
>We can see this dual cpu as next:
>
>  CPU1 -----
>           |--- HUB --| Memory banks for CPU1 & CPU2
>  CPU2 -----     |
>                 |
>                SN0 router
>
>So if 1 cpu is hammering on a single cache line in local memory
>it is using the hub to access the memory banks. This is true for
>the R14000 design which has 2GB local memory at the memory banks
>and 8MB L2 cache.

That should _not_ be working like that.  If cpu is hammering on a single
cache line, it should not even be talking to the hub.  It should be talking
to its _local_ cache.  That's the point of what are often called "shadow"
operations.  You make a "shadow" of something by putting it in cache, to get
off the system bus.  that is how my spinlocks work on my quad xeon in fact:

#  define Lock(p)               while(exchange(&p,1)) while(p)

The "exchange" is horribly slow on the intel box, because it locks the bus
down tight while reading/writing to accomplish the exchange.  But it does it
in an atomic manner which is required for true locks.

however, if the exchange says "value is already non-zero" I then do a simple
while(p) which spins on the value of P in the L1 cache.  No bus activity at
all while I am spinning.  when _another_ processor writes to p, my cache
controller finds out and invalidates that cache line.  that causes it to
do a cache line fill to get the new value.  If that value is zero, the
while(P) terminates and we go back to try a real exchange(P,1) atomically.

That is called a "shadow lock" and is _the_ solution to keep spinning processes
off the memory bus so that there is no interference with other processors.




>
>However it can't be working only within the L2 cache, because it is
>a read straight from local memory.

Then you don't have a real L2 cache there.  Because the purpose of any
cache is to _avoid_ going to real memory whenever possible, whether it is
local or not.

>
>Additionally in all supercomputer designs each cache line of memory
>has a number of bits added to it for who owns the memory cache line.
>
>You are correctly referring to the fact that if i do a write to the
>memory now, that cpu1 is only referencing to the local memory.
>
>So your assumption is correct if each node was a single cpu.
>
>However it is a dual and sometimes even a quad. This means in short
>that many cpu's, at least CPU2 is hammering also at the same time
>in the memory, and the bandwidth from this is only a fraction of
>what the 8MB L2 cache delivers. Every reference to main memory goes
>through the Hub simply which delivers 600MB/s effectively.

However, if a node is a quad, then each processor had _better_ have a real
processor cache or the machine is _never_ going to perform, since the
router speed can't even feed one cpu fully, much less 4...  Better check the
processor specs on that box as I can't imagine MIPS doing something that
poorly, based on past things they have produced.

>
>We can both imagine that this is a major problem at supercomputers.
>
>Therefore the approach to let them idle while not doing a job is
>the right one.


I simply don't think they operate as you describe, otherwise they will
perform like dogs, period.  Each CPU _must_ have a local processor cache,
or else a group of four cpus must have a 4-port local cache they share.
They have to have _something_ to avoid memory I/O even if it is local memory.
Memory is _always_ slow.  Local or not.




>
>You are obviously correct if the system design would be each node
>is a single cpu based. That is not the case however.
>
>Another effect is that the SN0 router is also attached to the Hub,
>i will not describe the effects here as i have no theory to base myself
>upon, but from my experiments i saw that if one of the CPU's is poking
>into its own local memory, that the SN0 router is busy somehow too with
>things which slows down the entire search simply as it eats away
>bandwidth.
>
>My knowledge of what intelligent algorithms the SN0 router uses
>to predict things is very limited. Near to zero actually. But somehow
>i feel it is not a good idea to keep the SN0 router also busy indirectly
>with reads to the local memory while CPU2 is searching its own way.
>
>So the absolutely best solution to the problem is pretty trivial and that's
>letting processes that do not search idle simply.


What is wrong with the above is the definition of "idle".  If _your_ process
is not running, what is the O/S doing?  It is executing code all the time
itself, the famous "idle loop"...


>
>The real basic problem one is confronted with in general is that initially
>a n processor partition idles a lot more than a dual or quad cpu box is
>doing.
>
>So we both hardly feel the fact that at the PC we are poking into a single
>cache line. At the supers this is a different thing however.
>
>They do not like me wasting bandwidth at all :)
>
>>
>>
>>
>>>
>>>Just imagine how much bandwidth that takes away.
>>>
>>>The machine has 1 TB a second bandwidth. This 512p partition has
>>>about half of that (machine has 1024p in total). If all these procs are
>>>spinning around that will confuse the SN0 routers completely.
>>
>>Not with cache.
>>
>>
>>>
>>>From that 0.5 TB bandwidth about 0.25TB a second is reserved for harddisk
>>>i/o. I can't use it. I use the other 600MB/s which the hub delivers for
>>>2 processors for memory bandwidth.
>>>
>>>We all can imagine what happens if the hubs are only busy delivering reads
>>>to the RAM.
>>>
>>>Right now each processor spins at its own shared memory variables,
>>>that's simply *not* a good design idea for NUMA. It is not working for
>>>any machine actually above 8 procs.
>>
>>It would work fine on a Cray with 32...
>>
>>
>>>
>>>Also shared memory buses you can't work with that design at all.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.