Author: Robert Hyatt
Date: 10:06:34 09/03/03
Go up one level in this thread
On September 03, 2003 at 00:07:37, Jeremiah Penery wrote: >On September 02, 2003 at 22:54:49, Robert Hyatt wrote: > >>Maybe. But I use threads. And on NUMA threads are _bad_. One example, >>do you _really_ want to _share_ all the attack bitmap stuff? That means it >>is in one processor's local memory, but will be slow for all others. What >>about the instructions? Same thing. > >After some thinking, it seems to me that the *average* memory access speed will >be the same no matter where the data is placed, for anything intended to be >shared between all processors (in a small NUMA configuration). The point for the "Crafty algorithm" is that I rarely share things among _all_ processors, except for the transposition/refutation table and pawn hash table. Split blocks are shared, but explaining the idea is not so easy. But to try: When a single processor is searching, and notices that there are idle processors, it takes its own split block, and copies the data to N new split blocks, one per processor. For all normal searching, each processor uses only its own split block, except at the position where the split occurred. There the parent split block is accessed by all threads to get the next move to search. That is not a very frequent access. And there, there will be penalties that are acceptable. But for the _rest_ of the work each processor does, I used a local split block for each so that they ran at max speed. That was the main change... Without that "fix" it ran very poorly. There was so much non-local memory traffic that performance was simply bad. With the fix, things worked much better. > The reason for >this is because what is local to one processor will be non-local to all others. >It doesn't matter if everything is local to the same processor or spread around, >because the same percentage of total accesses will be non-local in any case >(unless there is a disparity between the number of accesses each CPU is trying >to accomplish). That is _the_ point. A single processor spends 99.99999% of its time accessing its local "tree" structure. It spends the other .00001% of the time accessing the shared tree structure to get the next move at the split point. If you make _all_ the memory accesses non-local except for the one lucky processor that has it all local, then things go bad, quickly. You have a hot spot, and the "bandwidth" Vincent likes to ramble about won't help one bit. If every processor beats on one processor's local memory, bandwidth is _very_ low. > >The only problem is that one processor's memory banks might get hammered, but >that _is_ the same with an (similarly small) SMP configuration - all accesses go >serially through one memory controller. Yes, but that memory controller has about 4x the bandwidth of a one-cpu type memory controller due to 4-way memory interleaving, so the loss is not significant. 8-way boxes have a significant problem (say the 8-way dell xeon server) as they don't try to recover by using 8-way interleaving. > >As machine size increases, of course, NUMA can run into more problems. But then >SMP has its own problems as well (cost and complexity of memory sub-system, >mostly). This is _all_ about price and scalability. NUMA scales well, both in terms of price per processor and bandwidth per processor. But it does have some significant "issues" that make programming more complex and less efficient. Pure SMP boxes don't have the performance issues, but they don't scale to large numbers of processors, either physically or with respect to price.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.