Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question for Eugene

Author: Robert Hyatt

Date: 14:52:07 08/18/05

Go up one level in this thread


On August 18, 2005 at 14:43:46, Eugene Nalimov wrote:

>On August 18, 2005 at 06:15:15, Robert Hyatt wrote:
>
>>On August 16, 2005 at 19:00:48, Eugene Nalimov wrote:
>>
>>>On August 15, 2005 at 22:19:36, Robert Hyatt wrote:
>>>
>>>>In NUMA linux, when I malloc() or shmget() or whatever any kind of memory, it
>>>>isn't actually allocated on a specific node until the page is faulted in on a
>>>>reference.  This lets me shmget() the TREE data for each process before I fork()
>>>>the processes, then each process initializes its own TREE blocks, which faults
>>>>them into the physical memory on the node where that particular process is
>>>>running.
>>>>
>>>>Does windows behave the same way, or is the mallocInterleaved() approach
>>>>currently used in Crafty the best approach.  I'm going to have to do a little
>>>>tweaking to make the current program approach behave on windows, and if windows
>>>>allocates physical memory like linux, it makes the approach work on both, if
>>>>not, oh well...
>>>
>>>Look at the code I wrote. There are 2 functions:
>>>
>>>void *WinMalloc(size_t cbBytes, int iThread)
>>>void *WinMallocInterleaved(size_t cbBytes, int cThreads)
>>>
>>>Basically what is done in fisrt one is:
>>>* remember current CPU affinity mask
>>>* force current thread to be executed on CPU#iThread
>>>* allocate memory
>>>* fill it with zeroes, so it will be committed
>>>* restore CPU affinity mask
>>>
>>>The second function is very similar:
>>>* remember current CPU affinity mask
>>>* loop for CPU 0..N
>>>  * force current thread to be executed on that CPU
>>>  * allocate some memory
>>>  * fill it with zeroes, so it will be committed
>>>* restore CPU affinity mask
>>>
>>>Thanks,
>>>Eugene
>>
>>
>>I understood that part.  What wasn't clear was this:
>>
>>Suppose I malloc() everything up front, but do not touch it.  Then as threads
>>are spawned, they zero their own "split blocks" which on linux causes those
>>pages to be "faulted in" to the resident set, and the physical RAM is allocated
>>on the local node where they are first accessed.  It sort of looks like Windows
>>does the same thing based on your "allocate and touch" approach.
>>
>>Linux gives me a couple of approaches.  One as above is the simplest.  I can
>>also specify that memory be allocated on a specific node, but I am not sure that
>>is totally compatible with the shmget()/shmat() approach I am using to avoid
>>POSIX threads.
>>
>>What we have certainly works, but if windows behaves like linux, so that I can
>>malloc up front, and then touch as the threads get initialized, overall the code
>>will be a bit simpler since then both will be doing the same thing...
>>
>>Hence my question... :)
>
>I would not bet that malloc() does not touch memory it allocates, or that is
>always returned not yet commited memory, or that memory is cache line aligned.
>If you noticed for NUMA I am using not malloc() but Windows API calls that first
>reserve and than commit memory.
>
>Your change will probably work, but it will require extra testing...
>
>Thanks,
>Eugene


the memory I am allocating via shmget() must be cache-aligned because this
memory always starts on a page boundary and allocates in multiples of the
hardware page-size only...

Linux has a similar function.  It is possible to say "this must be put on node
x" and then any memory pages you touch to "fault in" beyond that point gets the
physical pages from node x's memory.  But I can't directly use the built-in
intrinsic for that as I need shmget() so that the memory is shared across the
processes, as opposed to malloc() memory which would become "private" since it
is not shared by definition...

The only headache I have found is that it is hard to verify where something is
loaded into physical RAM.  I did some unkosher things to see which physical RAM
pages were being used for the split blocks, and it all was done correctly.
There is, in threads, a problem with the first page of a thread's stack being
allocated on the node that creates the thread...  using fork() even this is not
an issue due to the unix copy-on-write VM approach (most everyone uses
copy-on-write in unix, I suspect windows does as well...).



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.