Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Thanks - absolutely convincing!

Author: Eugene Nalimov

Date: 14:20:22 03/31/04

Go up one level in this thread


On March 31, 2004 at 15:33:35, Gerd Isenberg wrote:

>>>>_chkstk() call is necessary if function allocates more than 4k on stack.
>>>
>>>I see - a page issue?
>>
>>The reason is the way Windows commits place on stack (allocates physical
>>memory). On program starup it reserves address space for your stack (1Mb by
>>default), but commits much less (again, all of this is default behavior -- you
>>can change it for your program). Next to the commited pages there is guard page.
>>If your program tries to access it there will be interrupt, Windows will commit
>>that page and mark the next one (with lower address) as new guard page.
>>
>>So your programs will have large stack by default, *and* system will allocate
>>only necessary amount of physical memory. If there are 20 processes running on
>>your box, such strategy will save almost 20Mb of RAM.
>>
>>As a result of such design program should not allocate more than 4k (on x86 and
>>AMD64) on stack without touching intermediate pages first. If you'll try to
>>access not yet commited stack location that is too far from current stack top
>>you'll get access violation.
>>
>>That is exactly what _chkstk() is doing -- it just "touches" intermediate pages
>>if your function wants to allocate more than one page on stack.
>>
>>Performance impact of _chkstk() calls is very small, because vast majority of
>>functions have less than 4k of local variables. And if function allocates more
>>than 4k, several instructions inside _chkstk() would not be noticeable.
>>[Actually we considered inlining _chkstk() when we are allocating only several
>>pages, but decided against it, because there would be no observable performance
>>gain on "normal" applications].
>>
>
>I see - makes sense.
>
>>>>
>>>>memset() call is faster than REP STOSQ. Trust me. BTW, the old version of the
>>>>compiler would generate REP STOSQ.
>>>
>>>Yes, interesting. Curious about what is inside memset ;-)
>>
>>Nothing really interesting :-) Function just that looks at the alignment and
>>size of the block that you are filling, and uses different algorithms for large
>>aligned blocks, large unaligned blocks, medium-sized blocks, small blocks, etc.
>>
>
>Ok, i wondered why some aligned and unconditional REP STOSQ isn't faster,
>specially with small (eg. < 32 qwords) count, so that the call/ret overhead
>becomes relative more expensive. I remember the AMD64 optimization manual about
>that issue...

REP uses lot of cycles just to start. It's better to use plain sequence of MOVs
for small count.

>>>>
>>>>And here is your assembly:
>>>
>>>Wow - absolutely convincing!
>>>
>>>Nice that all is inlined inside main, but the single functions are incarnated or
>>>listed separately.
>>>
>>>One minor point i don't understand inside the general purpose incarnation:
>>>
>>>updownAttacks<GPR>, COMDAT
>>>...
>>>; Line 222
>>>        ...
>>>	mov	QWORD PTR [rax-72], rbp
>>>        ...
>>>
>>>; Line 224
>>>	movaps	xmm0, XMMWORD PTR [rax-72]
>>>	movdqa	XMMWORD PTR [rax-72], xmm0
>>>
>>>Some undocumented trick?
>>
>>No, just compiler stupidity :-) You are copying from "gu" to "gd":
>>
>>	T gd(gu);
>>
>>Compiler was intelligent enough to allocate both variables in the same stack
>>location, but has not enough intelligence to get rid of the move (probably
>>because formally types are different -- I did not look at the details yet). We
>>cannot fix the issue prior to beta, but probably will fix it for the final
>>release.
>
>The final main inlining is rather free from such obstacles ;-)
>And xmm- and gp-instructions are interlaced from two inlined functions.
>That's really great!
>
>>
>>And there are some other places for which we can generate better code. You
>>probably did not noticed them, but I see inefficiences...
>>
>
>May be better instruction scheduling by using a few more registers?
>It should be possible with these two inlined kogge-stone functions to process
>four directions in parallel (two (three) xmm and two gpr). Even inside one
>direction, generator and propagator calculation may be interlaced.
>
>OTOH using xmm8-xmm15 implies an additional prefix-byte, but for queens...

Don't forget about out-of-order execution and register renaming. We are doing
some scheduling when it costs nothing, but rather than use extra architecture
registers and longer instructions it's (probably) better to rely on the hardware
to do the trick.

Thanks,
Eugene

>Cheers,
>Gerd
>
>>Thanks,
>>Eugene
>>



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.