Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Thanks - absolutely convincing!

Author: Gerd Isenberg

Date: 12:33:35 03/31/04

Go up one level in this thread


>>>_chkstk() call is necessary if function allocates more than 4k on stack.
>>
>>I see - a page issue?
>
>The reason is the way Windows commits place on stack (allocates physical
>memory). On program starup it reserves address space for your stack (1Mb by
>default), but commits much less (again, all of this is default behavior -- you
>can change it for your program). Next to the commited pages there is guard page.
>If your program tries to access it there will be interrupt, Windows will commit
>that page and mark the next one (with lower address) as new guard page.
>
>So your programs will have large stack by default, *and* system will allocate
>only necessary amount of physical memory. If there are 20 processes running on
>your box, such strategy will save almost 20Mb of RAM.
>
>As a result of such design program should not allocate more than 4k (on x86 and
>AMD64) on stack without touching intermediate pages first. If you'll try to
>access not yet commited stack location that is too far from current stack top
>you'll get access violation.
>
>That is exactly what _chkstk() is doing -- it just "touches" intermediate pages
>if your function wants to allocate more than one page on stack.
>
>Performance impact of _chkstk() calls is very small, because vast majority of
>functions have less than 4k of local variables. And if function allocates more
>than 4k, several instructions inside _chkstk() would not be noticeable.
>[Actually we considered inlining _chkstk() when we are allocating only several
>pages, but decided against it, because there would be no observable performance
>gain on "normal" applications].
>

I see - makes sense.

>>>
>>>memset() call is faster than REP STOSQ. Trust me. BTW, the old version of the
>>>compiler would generate REP STOSQ.
>>
>>Yes, interesting. Curious about what is inside memset ;-)
>
>Nothing really interesting :-) Function just that looks at the alignment and
>size of the block that you are filling, and uses different algorithms for large
>aligned blocks, large unaligned blocks, medium-sized blocks, small blocks, etc.
>

Ok, i wondered why some aligned and unconditional REP STOSQ isn't faster,
specially with small (eg. < 32 qwords) count, so that the call/ret overhead
becomes relative more expensive. I remember the AMD64 optimization manual about
that issue...


>>>
>>>And here is your assembly:
>>
>>Wow - absolutely convincing!
>>
>>Nice that all is inlined inside main, but the single functions are incarnated or
>>listed separately.
>>
>>One minor point i don't understand inside the general purpose incarnation:
>>
>>updownAttacks<GPR>, COMDAT
>>...
>>; Line 222
>>        ...
>>	mov	QWORD PTR [rax-72], rbp
>>        ...
>>
>>; Line 224
>>	movaps	xmm0, XMMWORD PTR [rax-72]
>>	movdqa	XMMWORD PTR [rax-72], xmm0
>>
>>Some undocumented trick?
>
>No, just compiler stupidity :-) You are copying from "gu" to "gd":
>
>	T gd(gu);
>
>Compiler was intelligent enough to allocate both variables in the same stack
>location, but has not enough intelligence to get rid of the move (probably
>because formally types are different -- I did not look at the details yet). We
>cannot fix the issue prior to beta, but probably will fix it for the final
>release.

The final main inlining is rather free from such obstacles ;-)
And xmm- and gp-instructions are interlaced from two inlined functions.
That's really great!

>
>And there are some other places for which we can generate better code. You
>probably did not noticed them, but I see inefficiences...
>

May be better instruction scheduling by using a few more registers?
It should be possible with these two inlined kogge-stone functions to process
four directions in parallel (two (three) xmm and two gpr). Even inside one
direction, generator and propagator calculation may be interlaced.

OTOH using xmm8-xmm15 implies an additional prefix-byte, but for queens...

Cheers,
Gerd

>Thanks,
>Eugene
>



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.