Author: Jeremiah Penery
Date: 15:51:31 09/03/03
Go up one level in this thread
On September 03, 2003 at 10:45:40, Vincent Diepeveen wrote: >On September 03, 2003 at 00:07:37, Jeremiah Penery wrote: > >>On September 02, 2003 at 22:54:49, Robert Hyatt wrote: >> >>>Maybe. But I use threads. And on NUMA threads are _bad_. One example, >>>do you _really_ want to _share_ all the attack bitmap stuff? That means it >>>is in one processor's local memory, but will be slow for all others. What >>>about the instructions? Same thing. >> >>After some thinking, it seems to me that the *average* memory access speed will >>be the same no matter where the data is placed, for anything intended to be > >If n cpu's can access local memory at 280 ns (R14000) >and accessing remote memory is 6-7 us, then what is faster? If all data is in CPU1's memory, the other processors all access as non-local. If the data is spread equally between CPUs, the total number of local and non-local accesses will be the same. As you say, the way to solve it is by copying it to all CPUs. That creates its own problems. >>shared between all processors (in a small NUMA configuration). The reason for >>this is because what is local to one processor will be non-local to all others. > >Unless each processor has its own local copy. > >Just for the record. Shipping a megabyte from 1 cpu to another cpu >and then locally accessing it, is way faster than >remotely accessing it @ random. How do you keep the global data updated for each processor if you're copying the entire thing each time to every CPU? You'll have to be "shipping megabytes" back and forth all the time, which will be a lot slower than remotely accessing memory once in a while. It's somewhat analogous to cache coherency. On a small machine it might not be a big deal, but when you get a bunch of processors, "shipping megabytes" around will soon come to *dominate* system bandwidth usage and will slow everything else way down. >>It doesn't matter if everything is local to the same processor or spread around, >>because the same percentage of total accesses will be non-local in any case >>(unless there is a disparity between the number of accesses each CPU is trying >>to accomplish). >> >>The only problem is that one processor's memory banks might get hammered, but >>that _is_ the same with an (similarly small) SMP configuration - all accesses go >>serially through one memory controller. > >memory banks you can access using the HUB in parallel. Reads cost nothing, >just the calling processor will be a bit waiting (like 6-7 us worst case, or >like 10-30 us for bob's new 4 node thing). If you have a crossbar, as in a big SMP machine, reads can go in parallel. But in a small SMP machine with a bus and a regular memory controller, accesses are made in serial (though it can issue multiple read requests before getting the first result back). >>As machine size increases, of course, NUMA can run into more problems. But then >>SMP has its own problems as well (cost and complexity of memory sub-system, >>mostly). > >Not if you directly connect each node to each node , which is what Cray does. > >That keeps latency very fast, but it's $$$$. That is exactly what I said. It's expensive and complicated. >So cc-NUMA is always slower for remote latency than Cray trivially. > >That you're still denying this is madness. When did I *ever* deny that? That you're still trying to convince yourself that you know everything is madness.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.