Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: How to make your movegen 4x slower in 1 easy step

Author: Miguel A. Ballicora

Date: 12:23:46 08/30/01

Go up one level in this thread


On August 30, 2001 at 13:56:53, Scott Gasch wrote:

>I'm moving around data structures in my engine to consolodate things that are
>going to be needed on a per-cpu basis if/when I go parallel.
>
>One such structure is my move stack.  It's a big array of moves with a start and
>end index per ply.  So for example it might look like this:
>
>start[0] = 0  ...  end[0] = 32  [array entries 0..32 hold moves at ply 0]
>start[1] = 33 ...  end[1] = 60  [array entries 33..60 hold moves at ply 1]
>...
>
>Well if more than one thread is searching at once I will need more than one move
>stack and more than one ply counter.  So I kept the same move stack struct and
>made g_MoveStack an array:
>
>MOVE_STACK g_MoveStack[NUM_CPU];
>
>The code to access the move stack goes from this:
>
>g_MoveStack.iStart[g_iPly] = 0;
>
>to this:
>
>g_MoveStack[iCpuId].iStart[g_iPly[iCpuId]] = 0;
>
>Talk about a huge impact -- move move generator benchmark literally is 4x
>slower!  These dereferences are damn expensive.  There has to be a better way,
>can one of you assembly gurus give me a clue?
>
>Here is a solution I am thinking about -- have a struct per-thread that houses
>the ply and a pointer to the start of the right move stack entry.  Then do
>something like this:
>
>THREAD_INFO *pThreadInfo = &(g_ThreadInfo[iCurrentThreadId]);
>(pThreadInfo->pMoveStack)->iStart[sThreadInfo->iCpuId] = 0;
>
>I bet this is just as slow though... Any advice?
>
>Thanks,
>Scott

My program is neither paralel nor fast, so I do not know if this if of any help
but you might even suggest something to me :-)
just in case this is what I do:

I do not have a move stack as you describe. I "allocate" dynamically memory
each time I generate moves and free that memory when I am done. Since malloc
and free are supposed to be slow and prone to bugs (using them incorrectly as I
probably would :-) I made my own "allocator".
However, the allocator is internally organized as a stack (So basically we are
doing the same). So, I think it does not help you but a different structure of
allocation can be organized to suit these needs. For instance, I . When I did
this I tried to hide as much details I could so I could use pointers with no
"readability
penalty" for me.

Another thing that I believe I do different is that I keep the ply counter being
passed as an parameter of search() rather than a global variable. No rationale
about this, I just tried to have no global variables in the program
or keep them to a minimum. This migh help you a tiny bit because you do not have
to fetch what is the plycounter for the thread you are in.

So you do
g_MoveStack[iCpuId].iStart[g_iPly] = 0;

rather than
g_MoveStack[iCpuId].iStart[g_iPly[iCpuId]] = 0;

Anyway, does not 4x sound too much? I mean, sounds like something else
is going on. What size is your stack, are not you thrashing cache problems?
What if you change the size of your stack? does it change?

Regards,
Miguel




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.