Author: Robert Hyatt
Date: 06:43:25 12/04/98
Go up one level in this thread
On December 04, 1998 at 02:45:21, Frank Schneider wrote: >On December 03, 1998 at 00:10:54, Robert Hyatt wrote: > >>On December 02, 1998 at 22:27:27, Bruce Moreland wrote: >> >>> >>>On December 02, 1998 at 14:30:59, Tom Kerrigan wrote: >>> >>>>You're right, if other people are running faster with chars, then the statement >>>>I made was wrong. >>>> >>>>I tested this with my own program, and moving from chars to ints for a few >>>>critical arrays sped up the program by more than 5%. Maybe I'm the exception to >>>>the rule. >>> >>>Could be, that's data too. Another argument for trying it. >>> >>>I think it is possible that either could be better in specific situations. >>> >>>The chars are smaller which may have some influence on cache behavior, and the >>>ints are bite-sized as far as the processor is concerned, so it doesn't have to >>>mess around with clearing the top-most bits. >>> >>>bruce >> >> >>There's two other explanations too. >> >>(1) changing the size of the board caused something *else* to become better >>aligned on a cache boundary. IE the famous case where you can delete an unused >>variable declaration and your program slows down or speeds up. >> >Just one example: >I once removed a FIXED16 from a struct (about >100byes large) and the >program sped up by 5%. > >>(2) the famous direct-mapped L2 cache problem where even the same program can >>vary in NPS significantly due to the real memory pages it occupies not mapping >>into the L2 cache very efficiently sometimes... >> >>Both are a pain. > >I also made another observation, but I'm not sure if it is nonsense. >The speed of the program seems to depend on the ordering of the list >of objectfiles the linker gets. Could that be a cache-prefetch related >effect? > >Frank Indirectly. L2 cache is direct-mapped. To optimally fill L2, you need the executable in consecutive pages of physical ram, or at least in consecutive addresses as far as the low-order log2(L2_size) bits of the addressed memory are concerned. IE if have a new xeon with 1024K of L2 cache, the low-order 20 bits of the physical addresses are used to map 32-byte lines into the L2 cache. For max performance, you want *the* 1024K of your program that is used most often to be located in memory addresses that are "consecutive" in the lower 20 bits of addresses. If you carefully order the objects (I do, check the Make file for Crafty) you can get the "engine" to fit into this scheme, and let the infrequently used stuff go elsewhere. Then you can get the entire engine to fit into L2 cache. That said, it doesn't work quite like that, because when you build an executable, you are building something with a virtual address space that is optimized, but you then have to let the operating system load those pages into physical memory, and nothing makes it load consecutive pages into consecutive physical addresses. So you *still* won't get optimal placement. Some operating systems use a "page coloring" scheme to optimize this for you, but most (including any windows variant) don't. Which is why you can run the same program multiple times (after rebooting and running different combinations of other software) and get different NPS numbers.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.