Computer Chess Club Archives


Search

Terms

Messages

Subject: crafty

Author: Robert Hyatt

Date: 18:25:42 08/15/05


Thought I would answer this in public, rather than multiple times via email.

"What did you have to do to get Crafty to work correctly on the new opteron?"

The first thing was getting it to run, and AMD installed red hat linux
(enterprise edition) on the box for me.  My first quick "scaling tests" showed
serious problems.  Later it turned out that it was less about scaling and more
about quirks in posix threads.  Something is wrong with the posix threads
version of the normal C library, in that it provides sporadic and flakey timing
information.  This made the scaling look bad, but in fact much of the problem
was just a problem in the library somewhere.

A couple of years back, I had already implemented a fork() version of the
parallel search for some alpha testing I was doing, since they didn't have posix
threads on their NUMA system.  The basic idea is that I changed the malloc()
calls to use shmget() and shmat() before forking processes.  This creates
shared-memory blocks that will remain shared after a fork().  The main change
was that there were a significant number of global variables that are shared by
default in posix threads.  I had to collect all of those into a struct, and map
the struct onto another shmget()/shmat() shared memory block.  So the final
result is that everything that was shared, still is shared.

On linux it is also important to note that when you do a fork(), your virtual
address space is not copied on the spot.  Linux uses copy-on-write, which means
that fork'ed processes share instructions, and all data that is initialized
before the fork() and not modified after the fork().  Memory usage therefore
looks the same as with posix threads, except that now each unique process has a
completely non-shared "stack" to use.

Since I had already developed this stuff previously, I decided to try it on the
opteron to see if it would have any effect.  It seemed to fix the problem.  It
fixed it because it no longer depended on the library routines that were
confused by the thread stuff.  AMD then installed SUSE linux for me, since
redhat had serious problems with NUMA (kernel would crash, NUMA library calls
would fail with numa-not-available, etc).  Once this was done, I had time to
take care of the final few NUMA details in the fork() version of the code so
that the data used frequently by a single CPU was located on the node that CPU
was on so that memory access time would be minimized.

The final step was to tune the internal search parameters to try to avoid
overloading the hot spots in a dual core chip.  There are two cpus, two L1
caches, two L2 caches, but one memory controller and one hypertransport.  In a
normal opteron, with just one cpu, the memory controller and hypertransport see
about 1/2 the traffic they see in a dual-core, and I needed to try to minimize
memory bandwidth (which means get memory on the local node whenever possible)
and minimize hypertransport traffic, such as all the "noise" the 8 caches
produce talking to each other.  In Crafty, splitting small trees causes extra
cache coherency traffic, but I have tunable parameters that can limit the splits
and avoid splitting trees that are too small.  Once those parameters were tuned,
we were consistently seeing 15-16M nodes per second.  In the TCB game today in
round 4, crafty announced a mate in 15 about 5-6 moves before the game ended,
and was searching about 20M nodes per second at that point.

After all of that was done, we were set...

I suspect I am going to dump posix threads completely.  The current approach
will work on all unix systems, and it gets away from some strange
incompatibilities.  For example, on sun, the default is N threads, but one
physical process (which will only use one physical CPU) unless a specific
threads library call is made to say "one physical process for each thread."
Also the last time I tested on Solaris, I was seeing a strange issue with
getting the CPU time for the set of threads.  So dumping threads is not an issue
with respect to performance.  The main benefit of threads is the ability to
create and destroy threads very quickly.  I don't destroy threads once they are
created, so if it takes a few more MS to initially create the thread, I don't
care.




This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.