Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Thanks - absolutely convincing!

Author: Eugene Nalimov

Date: 12:04:17 03/31/04

Go up one level in this thread


On March 31, 2004 at 13:43:03, Gerd Isenberg wrote:

>On March 31, 2004 at 12:57:09, Eugene Nalimov wrote:
>
>>On March 31, 2004 at 04:15:21, Gerd Isenberg wrote:
>>
>>>Looks fine ;-)
>>>
>>>Curious about call __chkstk in isDeBruijnN, but not in the recursive function
>>>genDeBruijn. Yes, isDeBruijnN has a local 4KByte array on the frame, and has to
>>>clear it too, so under runtime considerations call __chkstk doesn't matter much.
>>>Instead of call memset i would prefere an inlined intrinsic of that e.g. with a
>>>8-byte aligned bool array and REP STOSQ with rcx=4096/8. I guess there are some
>>>additional compiler flags...
>>>
>>>If you have some additional time, it would be nice to see the assembly of a
>>>kogge-stone filler with a bit more register pressure:
>>>
>>>Thanks again,
>>>Gerd
>>
>>_chkstk() call is necessary if function allocates more than 4k on stack.
>
>I see - a page issue?

The reason is the way Windows commits place on stack (allocates physical
memory). On program starup it reserves address space for your stack (1Mb by
default), but commits much less (again, all of this is default behavior -- you
can change it for your program). Next to the commited pages there is guard page.
If your program tries to access it there will be interrupt, Windows will commit
that page and mark the next one (with lower address) as new guard page.

So your programs will have large stack by default, *and* system will allocate
only necessary amount of physical memory. If there are 20 processes running on
your box, such strategy will save almost 20Mb of RAM.

As a result of such design program should not allocate more than 4k (on x86 and
AMD64) on stack without touching intermediate pages first. If you'll try to
access not yet commited stack location that is too far from current stack top
you'll get access violation.

That is exactly what _chkstk() is doing -- it just "touches" intermediate pages
if your function wants to allocate more than one page on stack.

Performance impact of _chkstk() calls is very small, because vast majority of
functions have less than 4k of local variables. And if function allocates more
than 4k, several instructions inside _chkstk() would not be noticeable.
[Actually we considered inlining _chkstk() when we are allocating only several
pages, but decided against it, because there would be no observable performance
gain on "normal" applications].

>>
>>memset() call is faster than REP STOSQ. Trust me. BTW, the old version of the
>>compiler would generate REP STOSQ.
>
>Yes, interesting. Curious about what is inside memset ;-)

Nothing really interesting :-) Function just that looks at the alignment and
size of the block that you are filling, and uses different algorithms for large
aligned blocks, large unaligned blocks, medium-sized blocks, small blocks, etc.

>>
>>And here is your assembly:
>
>Wow - absolutely convincing!
>
>Nice that all is inlined inside main, but the single functions are incarnated or
>listed separately.
>
>One minor point i don't understand inside the general purpose incarnation:
>
>updownAttacks<GPR>, COMDAT
>...
>; Line 222
>        ...
>	mov	QWORD PTR [rax-72], rbp
>        ...
>
>; Line 224
>	movaps	xmm0, XMMWORD PTR [rax-72]
>	movdqa	XMMWORD PTR [rax-72], xmm0
>
>Some undocumented trick?

No, just compiler stupidity :-) You are copying from "gu" to "gd":

	T gd(gu);

Compiler was intelligent enough to allocate both variables in the same stack
location, but has not enough intelligence to get rid of the move (probably
because formally types are different -- I did not look at the details yet). We
cannot fix the issue prior to beta, but probably will fix it for the final
release.

And there are some other places for which we can generate better code. You
probably did not noticed them, but I see inefficiences...

Thanks,
Eugene

>To load xmm0 as packed single float and to store it back to the same address but
>as packed int. Doesn't that imply some type penalty cycles, despite the fact
>that the load/store seems not necessary at all and may even introduce a 128-bit
>load stall due to the previous 64-bit store?
>
>Thanks,
>Gerd
>
><code snipped>



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.