Author: Eugene Nalimov
Date: 12:04:17 03/31/04
Go up one level in this thread
On March 31, 2004 at 13:43:03, Gerd Isenberg wrote: >On March 31, 2004 at 12:57:09, Eugene Nalimov wrote: > >>On March 31, 2004 at 04:15:21, Gerd Isenberg wrote: >> >>>Looks fine ;-) >>> >>>Curious about call __chkstk in isDeBruijnN, but not in the recursive function >>>genDeBruijn. Yes, isDeBruijnN has a local 4KByte array on the frame, and has to >>>clear it too, so under runtime considerations call __chkstk doesn't matter much. >>>Instead of call memset i would prefere an inlined intrinsic of that e.g. with a >>>8-byte aligned bool array and REP STOSQ with rcx=4096/8. I guess there are some >>>additional compiler flags... >>> >>>If you have some additional time, it would be nice to see the assembly of a >>>kogge-stone filler with a bit more register pressure: >>> >>>Thanks again, >>>Gerd >> >>_chkstk() call is necessary if function allocates more than 4k on stack. > >I see - a page issue? The reason is the way Windows commits place on stack (allocates physical memory). On program starup it reserves address space for your stack (1Mb by default), but commits much less (again, all of this is default behavior -- you can change it for your program). Next to the commited pages there is guard page. If your program tries to access it there will be interrupt, Windows will commit that page and mark the next one (with lower address) as new guard page. So your programs will have large stack by default, *and* system will allocate only necessary amount of physical memory. If there are 20 processes running on your box, such strategy will save almost 20Mb of RAM. As a result of such design program should not allocate more than 4k (on x86 and AMD64) on stack without touching intermediate pages first. If you'll try to access not yet commited stack location that is too far from current stack top you'll get access violation. That is exactly what _chkstk() is doing -- it just "touches" intermediate pages if your function wants to allocate more than one page on stack. Performance impact of _chkstk() calls is very small, because vast majority of functions have less than 4k of local variables. And if function allocates more than 4k, several instructions inside _chkstk() would not be noticeable. [Actually we considered inlining _chkstk() when we are allocating only several pages, but decided against it, because there would be no observable performance gain on "normal" applications]. >> >>memset() call is faster than REP STOSQ. Trust me. BTW, the old version of the >>compiler would generate REP STOSQ. > >Yes, interesting. Curious about what is inside memset ;-) Nothing really interesting :-) Function just that looks at the alignment and size of the block that you are filling, and uses different algorithms for large aligned blocks, large unaligned blocks, medium-sized blocks, small blocks, etc. >> >>And here is your assembly: > >Wow - absolutely convincing! > >Nice that all is inlined inside main, but the single functions are incarnated or >listed separately. > >One minor point i don't understand inside the general purpose incarnation: > >updownAttacks<GPR>, COMDAT >... >; Line 222 > ... > mov QWORD PTR [rax-72], rbp > ... > >; Line 224 > movaps xmm0, XMMWORD PTR [rax-72] > movdqa XMMWORD PTR [rax-72], xmm0 > >Some undocumented trick? No, just compiler stupidity :-) You are copying from "gu" to "gd": T gd(gu); Compiler was intelligent enough to allocate both variables in the same stack location, but has not enough intelligence to get rid of the move (probably because formally types are different -- I did not look at the details yet). We cannot fix the issue prior to beta, but probably will fix it for the final release. And there are some other places for which we can generate better code. You probably did not noticed them, but I see inefficiences... Thanks, Eugene >To load xmm0 as packed single float and to store it back to the same address but >as packed int. Doesn't that imply some type penalty cycles, despite the fact >that the load/store seems not necessary at all and may even introduce a 128-bit >load stall due to the previous 64-bit store? > >Thanks, >Gerd > ><code snipped>
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.