Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Reducing transposition table latency

Author: Gerd Isenberg
Date: 02:05:09 09/28/02
On September 27, 2002 at 23:29:58, Anthony Cozzie wrote:

>Recently, I profiled my chess engine, and one function in particular stood out.
>The transposition probe function takes about 7% of the CPU time, or about 350
>cycles/call.  All it does is access the transposition table, but the random
>nature of the accesses means that it usually misses in the cache AND the TLB,
>thus requiring 2 memory accesses at 100+ cycles each.
>
>In my engine, the search function generates the next move, makes the next move,
>checks if it is legal, checks if the opponent is in check, and recurses, so
>there are two calls to is_check() between when the transposition key is
>available and when the key is used.  I tried inserting a prefetch instruction [I
>run an Athlon] with absolutely no effect.  I even tried following the prefetch
>with a long loop to make SURE it would have enough time to access the memory,
>with no results.  Lastly I tried a MOV instruction, also with no result.  Am I
>just doing something wrong here?
>
>Has anyone else tried to something similar with better results?

Good idea, but not yet tried.

May be one cacheline isn't enough for you?
Have you tried all the prefetch instructions from 3DNow and MMX-extensions?
Have you played with the memcpy-example from amd's optimization guide (chapter
5, Memory Copy: Step 8), to see any prefetching effect there?

Another idea may be to use MOVNTQ for writing hash-entries.

Cheers,
Gerd

from:
AMD Extensions to the
3DNow!™ and MMX™ Instruction Sets Manual

AMD Extensions to the MMX™ Instruction Set:
================================================================================
PREFETCHNTA mem8 0Fh 18h / 0 Move data closer to the processor using the NTA
reference.
PREFETCHT0  mem8 0Fh 18h / 1 Move data closer to the processor using the T0
reference.
PREFETCHT1  mem8 0Fh 18h / 2 Move data closer to the processor using the T1
reference.
PREFETCHT2  mem8 0Fh 18h / 3 Move data closer to the processor using the T2
reference.

The operation of the prefetch instructions is processor implementation
dependent. The instructions can be ignored or changed by a processor
implementation, though they will not change program behavior. The cache line
size is also implementation dependent having a minimum size of 32 bytes.
================================================================================

from:
3DNow! Technology Manual TM
3DNow!™ Instruction Set:
================================================================================
PREFETCH(W) mem8 0F 0Dh Prefetch processor cache line into L1 data cache
(Dcache)
Privilege: none
Registers Affected: none
Flags Affected: none
Exceptions Generated: none

The PREFETCH instruction loads a processor cache line into the data cache. The
address of this line is specified by the mem8 value. For the AMD processor, the
line size is 32 bytes. In all future processors, the size of the line that is
loaded by the PREFETCH instruction will be at least 32-bytes. The PREFETCH
instruction loads a cache line even if the mem8 address is not aligned with the
start of the line (although some implementations, including the AMD-K6 family of
processors, may perform the cache fill starting from the cache miss or mem8
address). If a cache hit occurs (the line is already in the Dcache) or a memory
fault is detected, no bus cycle is initiated and the instruction is treated as a
NOP.
================================================================================

from:
AMD Athlon Processor
x86 Code Optimization
Guide TM
(chapter 5 Cache and Memory Optimizations)
================================================================================
Memory Copy: Step 8

The MOVNTQ instruction in the previous example improves the
speed of writing the data. The Prefetch Instruction example
uses a prefetch instruction to improve the performance on
reading the data. Prefetching cannot increase the total read
bandwidth, but it can get the processor started on loading the
data to the cache before the data is needed.

Example Code: Prefetch Instruction (prefetchnta)
(bandwidth: ~1250 Mbytes/sec improvement: 12%)

 mov esi,[src ] ///source array
 mov edi,[dst ] ///destination array
 mov ecx,[len ] ///number of QWORDS (8 bytes)
 lea esi,[esi+ecx*8 ]
 lea edi,[edi+ecx*8 ]
 neg ecx
 emms
copyloop:
 prefetchnta [esi+ecx*8 +512 ]
 movq mm0,qword ptr [esi+ecx*8 ]
 movq mm1,qword ptr [esi+ecx*8+8 ]
 movq mm2,qword ptr [esi+ecx*8+16 ]
 movq mm3,qword ptr [esi+ecx*8+24 ]
 movq mm4,qword ptr [esi+ecx*8+32 ]
 movq mm5,qword ptr [esi+ecx*8+40 ]
 movq mm6,qword ptr [esi+ecx*8+48 ]
 movq mm7,qword ptr [esi+ecx*8+56 ]
 movntq qword ptr [edi+ecx*8 ],mm0
 movntq qword ptr [edi+ecx*8+8 ],mm1
 movntq qword ptr [edi+ecx*8+16 ],mm2
 movntq qword ptr [edi+ecx*8+24 ],mm3
 movntq qword ptr [edi+ecx*8+32 ],mm4
 movntq qword ptr [edi+ecx*8+40 ],mm5
 movntq qword ptr [edi+ecx*8+48 ],mm6
 movntq qword ptr [edi+ecx*8+56 ],mm7
 add ecx,8
 jnz copyloop
 sfence
 emms

....
Use the PREFETCH 3DNow!™ Instruction

In some cases, using the PREFETCH or PREFECTHW
instruction on processors with hardware prefetch may incur a
reduction in performance. In these cases, hardware prefetch
can be disabled using one of the methods described in the
preceding paragraph or the PREFETCH instruction can be
removed. The engineer needs to weigh the measured gains
obtained on non-hardware prefetch enabled processors by using
the PREFETCH instruction, versus any loss in performance on
processors with the hardware prefetcher.

PREFETCH/W versus
PREFETCHNTA/T0/T1/T2
The PREFETCHNTA/T0/T1/T2 instructions in the MMX
extensions are processor implementation dependent. If the
developer needs to maintain compatibility with the 25 million
AMD-K6 ® -2 and AMD-K6-III processors already sold, use the
3DNow! PREFETCH/W instructions instead of the various
prefetch instructions that are new MMX extensions.
================================================================================
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.