Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: How to make your movegen 4x slower in 1 easy step (more)

Author: Robert Hyatt
Date: 07:21:02 08/31/01
On August 31, 2001 at 10:18:13, Robert Hyatt wrote:

>On August 31, 2001 at 09:03:28, Vincent Diepeveen wrote:
>
>>On August 31, 2001 at 00:20:28, Robert Hyatt wrote:
>>
>>>On August 30, 2001 at 20:45:07, Vincent Diepeveen wrote:
>>>
>>>>On August 30, 2001 at 13:56:53, Scott Gasch wrote:
>>>>
>>>>Some years ago i was faced with the same problems as you
>>>>face now with.
>>>>
>>>>Without doubt the best solution for windows is what Andrew Dados
>>>>suggests, the global thread variables.
>>>>
>>>>One small problem is that your program only works for windows then,
>>>>and monsoon no longer works then under linux.
>>>>
>>>>You can slow down your program, bob is always a fan of that, as overhead
>>>>doesn't need to be 400% of course. Bob estimates it at i think 10% or so?
>>>>
>>>>But by far the simplest solution to get rid of all these problems is
>>>>to get multiprocessing.
>>>>
>>>>Whatever way you search parallel, multiprocessing is faster if you
>>>>want to avoid all the tough global thread variable definitions!
>>>>
>>>>Also at the superb dual AMD SMP chipset there is no longer any disadvantage
>>>>in multiprocessing.
>>>>
>>>>For BSD it even has more advantages, as bsd can only do multiprocessing,
>>>>i heart multithreading might deliver problems under bsd at a multiprocessor
>>>>machine.
>>>>
>>>>My tip go for a 0% overhead, and 0% problem thing and go multiprocessing.
>>>>
>>>>whether you multithread or singlethread, that hashtable you need to share
>>>>anyway, so who cares?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>The problem is granularity.  With threads, I can store a global variable
>>>and have all threads see it instantly.  For message passing, as you are
>>
>>You can do that only if it is a volatile variable. In multiprocessing
>>you also can use shared global variables without problems. In fact
>>that's what i'm doing.
>
>They had better be volatile no matter which way you do the parallel search,
>or it is broken.
>
>
>>
>>>doing, the overhead is significant.  Thousands of instructions per message,
>>
>>I am nowhere using message passing in my engine.
>>
>>>which means you can't split the tree where you might one one processor to
>>>search a single node.  Threads don't have that overhead.  They have exactly
>>>none.
>>
>>I have no idea how you think i implemented this, but i'm simply sharing
>>the tree datastructure.
>>
>>Of course i need to approach that using a pointer, but that's exactly
>>how you do it.
>
>Then your point would be?  You claim your approach is "better" yet you have
>to pass a pointer around, which is exactly what I am doing.  And this is
>somehow "better"??
>
>
>
>>
>>The evaluation (slowest part of my program) i can use global arrays
>>without using a slow pointer which also needs to get passed to every
>>single function i use, which would also increase program size considerably
>>in all respects.
>
>I don't see how that makes it "better".  It does make it "bigger" since all
>your code and non-shared data is duplicated in memory N times, once for each
>process you start.



Actually this is horrible, if you are using EGTBs.  You lose the threaded
EGTB code that shares the LRU cache, and you duplicate the EGTB decompression
indices (20+ megabytes for 5 piece files) plus the EGTB cache plus all the EGTB
I/O that _can_ be shared among the threads if you use threads.  For EGTB stuff,
at least, using separate processes is awful.



>
>
>>
>>>
>>>
>>>
>>>
>>>>>I'm moving around data structures in my engine to consolodate things that are
>>>>>going to be needed on a per-cpu basis if/when I go parallel.
>>>>>
>>>>>One such structure is my move stack.  It's a big array of moves with a start and
>>>>>end index per ply.  So for example it might look like this:
>>>>>
>>>>>start[0] = 0  ...  end[0] = 32  [array entries 0..32 hold moves at ply 0]
>>>>>start[1] = 33 ...  end[1] = 60  [array entries 33..60 hold moves at ply 1]
>>>>>...
>>>>>
>>>>>Well if more than one thread is searching at once I will need more than one move
>>>>>stack and more than one ply counter.  So I kept the same move stack struct and
>>>>>made g_MoveStack an array:
>>>>>
>>>>>MOVE_STACK g_MoveStack[NUM_CPU];
>>>>>
>>>>>The code to access the move stack goes from this:
>>>>>
>>>>>g_MoveStack.iStart[g_iPly] = 0;
>>>>>
>>>>>to this:
>>>>>
>>>>>g_MoveStack[iCpuId].iStart[g_iPly[iCpuId]] = 0;
>>>>>
>>>>>Talk about a huge impact -- move move generator benchmark literally is 4x
>>>>>slower!  These dereferences are damn expensive.  There has to be a better way,
>>>>>can one of you assembly gurus give me a clue?
>>>>>
>>>>>Here is a solution I am thinking about -- have a struct per-thread that houses
>>>>>the ply and a pointer to the start of the right move stack entry.  Then do
>>>>>something like this:
>>>>>
>>>>>THREAD_INFO *pThreadInfo = &(g_ThreadInfo[iCurrentThreadId]);
>>>>>(pThreadInfo->pMoveStack)->iStart[sThreadInfo->iCpuId] = 0;
>>>>>
>>>>>I bet this is just as slow though... Any advice?
>>>>>
>>>>>Thanks,
>>>>>Scott
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.