Author: Robert Hyatt
Date: 18:25:42 08/15/05
Thought I would answer this in public, rather than multiple times via email. "What did you have to do to get Crafty to work correctly on the new opteron?" The first thing was getting it to run, and AMD installed red hat linux (enterprise edition) on the box for me. My first quick "scaling tests" showed serious problems. Later it turned out that it was less about scaling and more about quirks in posix threads. Something is wrong with the posix threads version of the normal C library, in that it provides sporadic and flakey timing information. This made the scaling look bad, but in fact much of the problem was just a problem in the library somewhere. A couple of years back, I had already implemented a fork() version of the parallel search for some alpha testing I was doing, since they didn't have posix threads on their NUMA system. The basic idea is that I changed the malloc() calls to use shmget() and shmat() before forking processes. This creates shared-memory blocks that will remain shared after a fork(). The main change was that there were a significant number of global variables that are shared by default in posix threads. I had to collect all of those into a struct, and map the struct onto another shmget()/shmat() shared memory block. So the final result is that everything that was shared, still is shared. On linux it is also important to note that when you do a fork(), your virtual address space is not copied on the spot. Linux uses copy-on-write, which means that fork'ed processes share instructions, and all data that is initialized before the fork() and not modified after the fork(). Memory usage therefore looks the same as with posix threads, except that now each unique process has a completely non-shared "stack" to use. Since I had already developed this stuff previously, I decided to try it on the opteron to see if it would have any effect. It seemed to fix the problem. It fixed it because it no longer depended on the library routines that were confused by the thread stuff. AMD then installed SUSE linux for me, since redhat had serious problems with NUMA (kernel would crash, NUMA library calls would fail with numa-not-available, etc). Once this was done, I had time to take care of the final few NUMA details in the fork() version of the code so that the data used frequently by a single CPU was located on the node that CPU was on so that memory access time would be minimized. The final step was to tune the internal search parameters to try to avoid overloading the hot spots in a dual core chip. There are two cpus, two L1 caches, two L2 caches, but one memory controller and one hypertransport. In a normal opteron, with just one cpu, the memory controller and hypertransport see about 1/2 the traffic they see in a dual-core, and I needed to try to minimize memory bandwidth (which means get memory on the local node whenever possible) and minimize hypertransport traffic, such as all the "noise" the 8 caches produce talking to each other. In Crafty, splitting small trees causes extra cache coherency traffic, but I have tunable parameters that can limit the splits and avoid splitting trees that are too small. Once those parameters were tuned, we were consistently seeing 15-16M nodes per second. In the TCB game today in round 4, crafty announced a mate in 15 about 5-6 moves before the game ended, and was searching about 20M nodes per second at that point. After all of that was done, we were set... I suspect I am going to dump posix threads completely. The current approach will work on all unix systems, and it gets away from some strange incompatibilities. For example, on sun, the default is N threads, but one physical process (which will only use one physical CPU) unless a specific threads library call is made to say "one physical process for each thread." Also the last time I tested on Solaris, I was seeing a strange issue with getting the CPU time for the set of threads. So dumping threads is not an issue with respect to performance. The main benefit of threads is the ability to create and destroy threads very quickly. I don't destroy threads once they are created, so if it takes a few more MS to initially create the thread, I don't care.
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.