Computer Chess Club Archives


Search

Terms

Messages

Subject: Linux problem at cc-NUMA machines

Author: Vincent Diepeveen

Date: 05:51:44 09/05/03

Go up one level in this thread


On September 04, 2003 at 11:08:28, Robert Hyatt wrote:

>On September 03, 2003 at 20:31:55, Vincent Diepeveen wrote:
>
>>On September 03, 2003 at 20:20:05, Vincent Diepeveen wrote:
>>
>>>On September 03, 2003 at 16:28:18, Robert Hyatt wrote:
>>>
>>>>On September 03, 2003 at 15:27:15, Vincent Diepeveen wrote:
>>>>
>>>>>On September 03, 2003 at 13:15:48, Robert Hyatt wrote:
>>>>>
>>>>>>On September 03, 2003 at 12:23:08, Vincent Diepeveen wrote:
>>>>>>
>>>>>>>On September 03, 2003 at 10:54:37, Sune Fischer wrote:
>>>>>>>
>>>>>>>>On September 03, 2003 at 10:48:31, Vincent Diepeveen wrote:
>>>>>>>>
>>>>>>>>>>I only see the need for communication when there is *somthing* to communicate.
>>>>>>>>>
>>>>>>>>>You answer your own question already. There continuesly is something to
>>>>>>>>>communicate.
>>>>>>>>
>>>>>>>>Such as?
>>>>>>>>
>>>>>>>>Whatever it is maybe it can be redesigned by using a smarter message system.
>>>>>>>>
>>>>>>>>The parent thread doesn't need to know *what* the child thread is doing, it only
>>>>>>>>needs to know what the child threads finds, if anything at all, right?
>>>>>>>>
>>>>>>>>-S.
>>>>>>>
>>>>>>>The only one who you are confusing is yourself.
>>>>>>>
>>>>>>>DIEP runs fine at any latency, but the speedup simply gets a lot less when the
>>>>>>>latency goes up.
>>>>>>>
>>>>>>>There are many practical problems.
>>>>>>>
>>>>>>>You speak about shipping messages.
>>>>>>>
>>>>>>>When are you going to receive them. Check each millisecond?
>>>>>>>
>>>>>>>Or let the OS decide?
>>>>>>>
>>>>>>>The OS fires at 100Hz, so things like processes that are sleeping because of the
>>>>>>>OS putting them to sleep (when locking and for 600 times they can't get the
>>>>>>>lock) then you have a latency of 10 ms before the process is awake.
>>>>>>>
>>>>>>>You are aware of such problems?
>>>>>>>
>>>>>>
>>>>>>No, because there is no such problem.  If you are running something else on
>>>>>>the same CPU, then you will see that 10ms latency.  If that CPU is idle, then
>>>>>>the instant the process is unblocked it will begin execution.
>>>>>
>>>>>Wrong, the 10ms latency is there to put something in the RUN queue of the
>>>>>kernel. Though there is no technical reason to remove that 10ms for the OS
>>>>>programmers, they are not allowed to do that, because that is violating
>>>>>agreements with important software manufacturers which have written software
>>>>>that assumes 10ms latency here and this crucial software will crash and cause
>>>>>severe problems if it is no longer there.
>>>>>
>>>>>The OS helpdesk.
>>>>
>>>>It absolutely does _not_ work like that.  What happens is this:
>>>>
>>>>Processes are blocked.  As an interrupt comes in, a process gets moved from
>>>>blocked to ready.  The temptation is to move that process from ready to run
>>>>if it is higher in priority than the process already in run.  But that causes
>>>>excessive context switching.  So, the process gets moved to ready and there
>>>>it sits until the next 10ms timer interrupt fires, and _then_ the scheduler
>>>>is called to move the currently running process back to ready, and the newly
>>>>ready process (of a higher priority) into running.
>>>>
>>>>That is _all_ there is to it.
>>>>
>>>>If the CPU is idle, and the interrupt comes in, the process is scheduled
>>>>_right now_, it goes from blocked to ready to run _instantly_.  No 10ms
>>>>delay.
>>>>
>>>>There is no doubt about how that works.  And your explanation is simply
>>>>garbage.  Ask some of the linux kernel guys.  Ingo Molnar is a good one to
>>>>ask although Alan Cox will also answer.
>>>
>>>The double origin3800 has 1024 cpu's. One partition (P7) is sized 512 processors
>>>from which 500 can get used simultaneously to run a single program cc-NUMA.
>>
>>now one more thing to mention here. Please don't start saying how good linux is
>>compared to other OSes.
>>
>>For cc-NUMA it's a joke simply. It's performing for latency and scheduling very
>>very poor when compared to IRIX.
>>
>>From all OSes at any parallel machine i would blindfolded pick IRIX for
>>performance.
>>
>>Linux has a long way to go and will never reach that. GCC never did either with
>>no hope of reaching it.
>
>Linux _will_ get there.  the compiler has _nothing_ to do with NUMA issues, so
>GCC is a moot point.

linux will never get there because it is not hardware specific.

to give example of the ALTIX3000 hardware where linux never will learn to work
with. Please look at the hardware picture as in the presentation from Dr Peter
Michielse (SGI Netherlands):
  http://www.sara.nl/news/recent/20030703/seminar010703_Michielse_SGI.pdf

It doesn't know simply which router connects to which SHUB and that routers are
more expensive to go through than a SHUB interconnect is.

So when you start at a partly loaded machine a new job of say 12 cpu's, then it
simply doesn't know how to efficiently schedule them at the machine.

If you do not understand this *basic* NUMA problem then you will *never* get a
clue from scheduling *ever*.

It's like a compiler not taking into account that a processor is using fall
through as a basic branch prediction mechanism and that it doesn't know how to
avoid partial register stalls and it that it doesn't know about branches getting
a lot of penalty.

That is basic processor knowledge, just like the build up of a machine is basic
scheduling knowledge.

Now don't blame SGI on this. The machine is GREAT. It is simply an improved
origin3800 system and IRIX schedules very well at it.

It is trivial that a SHUB is more efficient, cheaper and faster than extra
routers.

Linux kernel and GCC are well known for not being as hardware specific as
rivalling OSes and products.

Till today that's why GCC is slower than other compilers, but it still does a
good job compared to how poor linux is doing on such 64 processor cc-NUMA
machines.

Last months i have been a witness of that for the first NUMA kernels from linux.
We're talking about such a dumb way of scheduling here that it could have been
scheduled better for latency with a factor 2.

Best regards,
Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.