Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: UNIX question: WaitForSingleObject() under IRIX/Linux

Author: Vincent Diepeveen

Date: 05:17:11 09/26/02

Go up one level in this thread


On September 25, 2002 at 16:10:51, Robert Hyatt wrote:

>On September 25, 2002 at 14:45:01, Vincent Diepeveen wrote:
>
>>On September 25, 2002 at 14:17:55, Robert Hyatt wrote:
>>
>>>On September 25, 2002 at 11:49:25, Russell Reagan wrote:
>>>
>>>>On September 25, 2002 at 08:10:06, Vincent Diepeveen wrote:
>>>>
>>>>>i cannot use select() at all as i limit myself to  < 128 processor
>>>>>partitions then.
>>>>
>>>>Does WaitForSingleObject or WaitForMultipleObjects allow you to use more than
>>>>128 processors?
>>>>
>>>>>also i have no idea how to get it to work and whether it can do
>>>>>it 400 times a second instantly.
>>>>
>>>>See the problems Microsoft causes? They always have to be different (in a bad
>>>>evil kind of way).
>>>
>>>
>>>The very concept of synchronizing > 128 processes with such a system call
>>>defies any sort of logic I can think of.  Doing it hundreds of times a second
>>>only guarantees horrible performance.  Doing this every now and then would
>>>be ok.  But not "400 times a second".  There are other ways to eliminate that
>>>kind of stuff...
>>
>>the first 10 ply you get out of hashtable with just 1 processor obviously.
>>
>>Only after that the other 511 can join in. they manage themselves after
>>this. I don't want to let them poll while they are not searching. i have
>>no option there.
>
>You simply don't understand memory architecture yet.  "polling" works
>perfectly.  You wait on a pointer to become non-zero for example.  While
>you are "polling" it takes _no_ bandwidth because the value lands in your
>local cache.  All the cache controllers have to do is inform each other when
>something changes.  IE this is how the Intel boxes work and they do that quite
>well.
>
>NUMA shouldn't change that basic idea...

There is a good reason why it even goes wrong at NUMA.

The basic problem is that also at the NUMA machines that each node
is 2 processors and not 1, in the latter case you would be correct.

We can see this dual cpu as next:

  CPU1 -----
           |--- HUB --| Memory banks for CPU1 & CPU2
  CPU2 -----     |
                 |
                SN0 router

So if 1 cpu is hammering on a single cache line in local memory
it is using the hub to access the memory banks. This is true for
the R14000 design which has 2GB local memory at the memory banks
and 8MB L2 cache.

However it can't be working only within the L2 cache, because it is
a read straight from local memory.

Additionally in all supercomputer designs each cache line of memory
has a number of bits added to it for who owns the memory cache line.

You are correctly referring to the fact that if i do a write to the
memory now, that cpu1 is only referencing to the local memory.

So your assumption is correct if each node was a single cpu.

However it is a dual and sometimes even a quad. This means in short
that many cpu's, at least CPU2 is hammering also at the same time
in the memory, and the bandwidth from this is only a fraction of
what the 8MB L2 cache delivers. Every reference to main memory goes
through the Hub simply which delivers 600MB/s effectively.

We can both imagine that this is a major problem at supercomputers.

Therefore the approach to let them idle while not doing a job is
the right one.

You are obviously correct if the system design would be each node
is a single cpu based. That is not the case however.

Another effect is that the SN0 router is also attached to the Hub,
i will not describe the effects here as i have no theory to base myself
upon, but from my experiments i saw that if one of the CPU's is poking
into its own local memory, that the SN0 router is busy somehow too with
things which slows down the entire search simply as it eats away
bandwidth.

My knowledge of what intelligent algorithms the SN0 router uses
to predict things is very limited. Near to zero actually. But somehow
i feel it is not a good idea to keep the SN0 router also busy indirectly
with reads to the local memory while CPU2 is searching its own way.

So the absolutely best solution to the problem is pretty trivial and that's
letting processes that do not search idle simply.

The real basic problem one is confronted with in general is that initially
a n processor partition idles a lot more than a dual or quad cpu box is
doing.

So we both hardly feel the fact that at the PC we are poking into a single
cache line. At the supers this is a different thing however.

They do not like me wasting bandwidth at all :)

>
>
>
>>
>>Just imagine how much bandwidth that takes away.
>>
>>The machine has 1 TB a second bandwidth. This 512p partition has
>>about half of that (machine has 1024p in total). If all these procs are
>>spinning around that will confuse the SN0 routers completely.
>
>Not with cache.
>
>
>>
>>From that 0.5 TB bandwidth about 0.25TB a second is reserved for harddisk
>>i/o. I can't use it. I use the other 600MB/s which the hub delivers for
>>2 processors for memory bandwidth.
>>
>>We all can imagine what happens if the hubs are only busy delivering reads
>>to the RAM.
>>
>>Right now each processor spins at its own shared memory variables,
>>that's simply *not* a good design idea for NUMA. It is not working for
>>any machine actually above 8 procs.
>
>It would work fine on a Cray with 32...
>
>
>>
>>Also shared memory buses you can't work with that design at all.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.