Computer Chess Club Archives


Search

Terms

Messages

Subject: Why multiprocessing is hell faster than multithreading

Author: Vincent Diepeveen

Date: 03:59:01 09/01/01

Go up one level in this thread


On August 31, 2001 at 10:18:13, Robert Hyatt wrote:

>On August 31, 2001 at 09:03:28, Vincent Diepeveen wrote:
>
>>On August 31, 2001 at 00:20:28, Robert Hyatt wrote:
>>
>>>On August 30, 2001 at 20:45:07, Vincent Diepeveen wrote:
>>>
>>>>On August 30, 2001 at 13:56:53, Scott Gasch wrote:
[snip]
>>You can do that only if it is a volatile variable. In multiprocessing
>>you also can use shared global variables without problems. In fact
>>that's what i'm doing.
>
>They had better be volatile no matter which way you do the parallel search,
>or it is broken.

Exactly.

>
>>>doing, the overhead is significant.  Thousands of instructions per message,
>>
>>I am nowhere using message passing in my engine.
>>
>>>which means you can't split the tree where you might one one processor to
>>>search a single node.  Threads don't have that overhead.  They have exactly
>>>none.
>>
>>I have no idea how you think i implemented this, but i'm simply sharing
>>the tree datastructure.
>>
>>Of course i need to approach that using a pointer, but that's exactly
>>how you do it.
>
>Then your point would be?  You claim your approach is "better" yet you have
>to pass a pointer around, which is exactly what I am doing.  And this is
>somehow "better"??

Because everything which is nonsearch is not needing an extra pointer.

In short if i evaluate a pattern

int EvalRooksInEndgameBecauseOfA(void) {
  int s=0;
  if( board[sq_a2] == rook && board[sq_a3] == pawn )
    s += blabla;
  return s;
}

In crafty this would be
int EvalRooksInEndgameBecauseOfA(boarddatastructureforthread *b) {
  int s=0;
  if( b->board[sq_a2] == rook && b->board[sq_a3] == pawn )
    s += blabla;
  return s;
}

I do not need the SLOW indirection 'b->' pointer!

Where you can share data i can do that too using shared memory (EGTBs too).
So where we in search both use an extra pointer, in evaluation and
everything that is using in fact data which is for each thread different,
there i can use global data and you need an extra pointer.

Dissassemble and get horrified what your pointers need for number of
extra (unnecessary) clocks!

>
>
>>
>>The evaluation (slowest part of my program) i can use global arrays
>>without using a slow pointer which also needs to get passed to every
>>single function i use, which would also increase program size considerably
>>in all respects.
>
>I don't see how that makes it "better".  It does make it "bigger" since all
>your code and non-shared data is duplicated in memory N times, once for each
>process you start.

That doesn't matter at all Bob if you keep into mind how L2 cache works,
it only delivers to the cpu what it asks for, it has nothing to do
with what the other L2 cache is storing for a different search process!

Best regards,
Vincent

>
>>
>>>
>>>
>>>
>>>
>>>>>I'm moving around data structures in my engine to consolodate things that are
>>>>>going to be needed on a per-cpu basis if/when I go parallel.
>>>>>
>>>>>One such structure is my move stack.  It's a big array of moves with a start and
>>>>>end index per ply.  So for example it might look like this:
>>>>>
>>>>>start[0] = 0  ...  end[0] = 32  [array entries 0..32 hold moves at ply 0]
>>>>>start[1] = 33 ...  end[1] = 60  [array entries 33..60 hold moves at ply 1]
>>>>>...
>>>>>
>>>>>Well if more than one thread is searching at once I will need more than one move
>>>>>stack and more than one ply counter.  So I kept the same move stack struct and
>>>>>made g_MoveStack an array:
>>>>>
>>>>>MOVE_STACK g_MoveStack[NUM_CPU];
>>>>>
>>>>>The code to access the move stack goes from this:
>>>>>
>>>>>g_MoveStack.iStart[g_iPly] = 0;
>>>>>
>>>>>to this:
>>>>>
>>>>>g_MoveStack[iCpuId].iStart[g_iPly[iCpuId]] = 0;
>>>>>
>>>>>Talk about a huge impact -- move move generator benchmark literally is 4x
>>>>>slower!  These dereferences are damn expensive.  There has to be a better way,
>>>>>can one of you assembly gurus give me a clue?
>>>>>
>>>>>Here is a solution I am thinking about -- have a struct per-thread that houses
>>>>>the ply and a pointer to the start of the right move stack entry.  Then do
>>>>>something like this:
>>>>>
>>>>>THREAD_INFO *pThreadInfo = &(g_ThreadInfo[iCurrentThreadId]);
>>>>>(pThreadInfo->pMoveStack)->iStart[sThreadInfo->iCpuId] = 0;
>>>>>
>>>>>I bet this is just as slow though... Any advice?
>>>>>
>>>>>Thanks,
>>>>>Scott



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.