Author: Gerd Isenberg
Date: 02:05:09 09/28/02
Go up one level in this thread
On September 27, 2002 at 23:29:58, Anthony Cozzie wrote: >Recently, I profiled my chess engine, and one function in particular stood out. >The transposition probe function takes about 7% of the CPU time, or about 350 >cycles/call. All it does is access the transposition table, but the random >nature of the accesses means that it usually misses in the cache AND the TLB, >thus requiring 2 memory accesses at 100+ cycles each. > >In my engine, the search function generates the next move, makes the next move, >checks if it is legal, checks if the opponent is in check, and recurses, so >there are two calls to is_check() between when the transposition key is >available and when the key is used. I tried inserting a prefetch instruction [I >run an Athlon] with absolutely no effect. I even tried following the prefetch >with a long loop to make SURE it would have enough time to access the memory, >with no results. Lastly I tried a MOV instruction, also with no result. Am I >just doing something wrong here? > >Has anyone else tried to something similar with better results? Good idea, but not yet tried. May be one cacheline isn't enough for you? Have you tried all the prefetch instructions from 3DNow and MMX-extensions? Have you played with the memcpy-example from amd's optimization guide (chapter 5, Memory Copy: Step 8), to see any prefetching effect there? Another idea may be to use MOVNTQ for writing hash-entries. Cheers, Gerd from: AMD Extensions to the 3DNow!™ and MMX™ Instruction Sets Manual AMD Extensions to the MMX™ Instruction Set: ================================================================================ PREFETCHNTA mem8 0Fh 18h / 0 Move data closer to the processor using the NTA reference. PREFETCHT0 mem8 0Fh 18h / 1 Move data closer to the processor using the T0 reference. PREFETCHT1 mem8 0Fh 18h / 2 Move data closer to the processor using the T1 reference. PREFETCHT2 mem8 0Fh 18h / 3 Move data closer to the processor using the T2 reference. The operation of the prefetch instructions is processor implementation dependent. The instructions can be ignored or changed by a processor implementation, though they will not change program behavior. The cache line size is also implementation dependent having a minimum size of 32 bytes. ================================================================================ from: 3DNow! Technology Manual TM 3DNow!™ Instruction Set: ================================================================================ PREFETCH(W) mem8 0F 0Dh Prefetch processor cache line into L1 data cache (Dcache) Privilege: none Registers Affected: none Flags Affected: none Exceptions Generated: none The PREFETCH instruction loads a processor cache line into the data cache. The address of this line is specified by the mem8 value. For the AMD processor, the line size is 32 bytes. In all future processors, the size of the line that is loaded by the PREFETCH instruction will be at least 32-bytes. The PREFETCH instruction loads a cache line even if the mem8 address is not aligned with the start of the line (although some implementations, including the AMD-K6 family of processors, may perform the cache fill starting from the cache miss or mem8 address). If a cache hit occurs (the line is already in the Dcache) or a memory fault is detected, no bus cycle is initiated and the instruction is treated as a NOP. ================================================================================ from: AMD Athlon Processor x86 Code Optimization Guide TM (chapter 5 Cache and Memory Optimizations) ================================================================================ Memory Copy: Step 8 The MOVNTQ instruction in the previous example improves the speed of writing the data. The Prefetch Instruction example uses a prefetch instruction to improve the performance on reading the data. Prefetching cannot increase the total read bandwidth, but it can get the processor started on loading the data to the cache before the data is needed. Example Code: Prefetch Instruction (prefetchnta) (bandwidth: ~1250 Mbytes/sec improvement: 12%) mov esi,[src ] ///source array mov edi,[dst ] ///destination array mov ecx,[len ] ///number of QWORDS (8 bytes) lea esi,[esi+ecx*8 ] lea edi,[edi+ecx*8 ] neg ecx emms copyloop: prefetchnta [esi+ecx*8 +512 ] movq mm0,qword ptr [esi+ecx*8 ] movq mm1,qword ptr [esi+ecx*8+8 ] movq mm2,qword ptr [esi+ecx*8+16 ] movq mm3,qword ptr [esi+ecx*8+24 ] movq mm4,qword ptr [esi+ecx*8+32 ] movq mm5,qword ptr [esi+ecx*8+40 ] movq mm6,qword ptr [esi+ecx*8+48 ] movq mm7,qword ptr [esi+ecx*8+56 ] movntq qword ptr [edi+ecx*8 ],mm0 movntq qword ptr [edi+ecx*8+8 ],mm1 movntq qword ptr [edi+ecx*8+16 ],mm2 movntq qword ptr [edi+ecx*8+24 ],mm3 movntq qword ptr [edi+ecx*8+32 ],mm4 movntq qword ptr [edi+ecx*8+40 ],mm5 movntq qword ptr [edi+ecx*8+48 ],mm6 movntq qword ptr [edi+ecx*8+56 ],mm7 add ecx,8 jnz copyloop sfence emms .... Use the PREFETCH 3DNow!™ Instruction In some cases, using the PREFETCH or PREFECTHW instruction on processors with hardware prefetch may incur a reduction in performance. In these cases, hardware prefetch can be disabled using one of the methods described in the preceding paragraph or the PREFETCH instruction can be removed. The engineer needs to weigh the measured gains obtained on non-hardware prefetch enabled processors by using the PREFETCH instruction, versus any loss in performance on processors with the hardware prefetcher. PREFETCH/W versus PREFETCHNTA/T0/T1/T2 The PREFETCHNTA/T0/T1/T2 instructions in the MMX extensions are processor implementation dependent. If the developer needs to maintain compatibility with the 25 million AMD-K6 ® -2 and AMD-K6-III processors already sold, use the 3DNow! PREFETCH/W instructions instead of the various prefetch instructions that are new MMX extensions. ================================================================================
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.