Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: about SEE of Crafty

Author: Robert Hyatt
Date: 06:52:39 01/07/04
On January 07, 2004 at 08:40:02, Vasik Rajlich wrote:

>On January 06, 2004 at 11:49:32, Robert Hyatt wrote:
>
>>On January 06, 2004 at 11:41:59, martin fierz wrote:
>>
>>>On January 06, 2004 at 11:35:42, Gerd Isenberg wrote:
>>>
>>>>On January 06, 2004 at 10:56:00, martin fierz wrote:
>>>>
>>>>>On January 06, 2004 at 09:40:33, Robert Hyatt wrote:
>>>>>
>>>>>>Maybe there is some confusion here.  There is _no_ cost for "allocating"
>>>>>>local variables in a function.  And I _do_ mean _NO_.  These values are
>>>>>>simply "on the stack" and when the function is entered, the stack pointer has
>>>>>>some constant subtracted from it to leave a "hole" that is used to hold local
>>>>>>function variables.  no malloc() or anything like that is done by modern
>>>>>>compilers, hence zero cost.
>>>>>
>>>>>hi bob,
>>>>>
>>>>>is this really true? i used to have a HUGE structure describing a move in the
>>>>>first version of my chess program (don't ask, please...). in my negamax function
>>>>>i had this
>>>>>
>>>>>int negamax(int alpha, int beta, int depth)
>>>>>   {
>>>>>   MOVE movelist[MAXMOVES];
>>>>>   ...other variables...
>>>>>
>>>>>   ...typical negamax code...
>>>>>
>>>>>   }
>>>>>
>>>>>i saw clear speed differences depending on the value i used for MAXMOVES - and
>>>>>in my tests, i made sure that using 100 would never overflow in the positions i
>>>>>was searching, then i set it to e.g. 200 and the program would get a few %
>>>>>slower, although these additional 100 array elements were never used.
>>>>>
>>>>>how can this be explained? i'm not memsetting the array to 0 or anything.
>>>>
>>>>Hi Martin,
>>>>
>>>>that was my point in my reply to Bob - instruction length.
>>>>
>>>>Did you expect the assembly,
>>>
>>>hi gerd,
>>>
>>>for heaven's sake no! to me, __asm is about equivalent to chinese, i'm not an
>>>expert like you :-)
>>>
>>>>specially the bp-relative offsets of the other
>>>>locals load/stores? In C, per default all locals of a function on the stack,
>>>>including actual parameters are addressed relative to the base pointer register.
>>>>As long those offsets are in a signed byte range (-128..+127), with X86-32 only
>>>>one byte (to sign extend) is necessary, otherwise the complete 32-bit offset is
>>>>needed.
>>>
>>>...and this sounds a bit like chinese too! forgive me for my stupidity in this
>>>area! are you saying that if the storage of my locals exceeds X bytes (X=256?),
>>>then my function will get slower? and that the execution speed of my function
>>>would depend on the storage amount in a step-like way: locals need less than X
>>>bytes => fast; else => slow?
>>>because that is not what i saw. i didn't measure exactly, but the larger i made
>>>MAXMOVES, the slower the program got.
>>
>>Once you go beyond 128 bytes, you get into the problem Gerd mentioned.  The
>>problem is that we need to use register + displacement + index.  The
>>displacement can be 1 byte or 4, which makes the instruction longer, and hence
>>slower.  I had not considered that and it _does_ make a difference.  But once
>>you blow 128 bytes, it should not get worse, and the difference is not going to
>>be huge (ie here huge is 1%) anyway...
>>
>>But as I mentioned previously, you are pushing the stack size out, and that
>>will have perhaps unexpected effects on cache.  A global will always end up
>>in the same place in memory and take up one chunk of cache.  Locals will
>>stack up if you are talking recursion, and take up multiple chunks of cache,
>>displacing something else.
>>
>
>Is there some good way to estimate the cost due to cache misses of adding data
>or code? It seems to me that a very well-designed engine could more or less
>"stay in the cache", or keep some important kernel in the cache, while adding
>some large structure to this could in the worst case start a sort of
>cache-cycling, where there are almost no hits.



There are a few basic rules:

(1) smaller is better;

(2) temporal locality.  Variables that are used close together in terms of
time of reference should be close together in memory so they appear in one
cache line.

(3) cache-memory aliasing is an issue that most O/S memory managers seem to
ignore, where multiple real pages of a process end up direct-mapping to the
same cache lines, causing thrashing...  There is little you can do about this
however, from a programming point of view.  Every time you run the program, you
get a different aliasing, unless the O/S knows about an approach known as
"page coloring" to fill in a physical address space to provide optimal
cache mapping.

(4) random addresses hurt due to the TLB miss penalty of 1-2 extra memory
latency cycles added to random address access times.

(5) sequential memory referencing is faster due to pre-fetching caused by
long cache line sizes.

There are many others, but in many cases it really helps to know the actual
target machine.  The above are pretty general rules.

>
>Probably it's better not to worry too much about this, but maybe there are some
>rules of thumb regarding the tradeoffs between adding some data structure or
>specialized routine and the # of instructions saved.

As a general rule, number of instructions are less important than number
of cache line misses.  You can execute several hundred instructions during
the period of time it takes to fetch a memory word that is not currently in
cache.

>
>Vas
>
>>
>>>
>>>cheers
>>>  martin
>>>
>>>>
>>>>Gerd
>>>>
>>>>>
>>>>>like uri, i thought the program was allocating memory every time it entered the
>>>>>function, and that that was taking time. i ended up making my MOVE variable much
>>>>>smaller - although it's still much larger than crafty's :-)
>>>>>
>>>>>i thought perhaps i should allocate the movelist beforehand as a global, like
>>>>>this:
>>>>>
>>>>>MOVE *movelist[MAX_SEARCH_DEPTH]
>>>>>for(i=0;i<MAX_SEARCH_DEPTH;i++)
>>>>>   movelist[i] = malloc(sizeof(MOVE)*MAXMOVES)
>>>>>
>>>>>and then in the negamax function use the appropriate movelist for that search
>>>>>depth. that turned out to be significantly slower than the local variable
>>>>>approach. again, i don't understand why that would be... - anyone?
>>>>>
>>>>>cheers
>>>>>  martin
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.