Author: Gerd Isenberg
Date: 13:33:24 12/06/02
Go up one level in this thread
AMD Athlon Processor x86 Code Optimization Guide TM Implementation of Write Combining (pg 231...) Write-Combining Definitions and Abbreviations This appendix uses the following definitions and abbreviations: ■UC—Uncacheable memory type ■WC—Write-combining memory type ■WT—Writethrough memory type ■WP—Write-protected memory type ■WB—Writeback memory type ■One Byte—8 bits ■One Word—16 bits ■Longword—32 bits (same as a x86 doubleword) ■Quadword—64 bits or 2 longwords ■Octaword—128 bits or 2 quadwords ■Cache Block—64 bytes or 4 octawords or 8 quadwords What is Write Combining? Write combining is the merging of multiple memory write cycles that target locations within the address range of a write buffer. The AMDAthlon processor combines multiple memory-write cycles to a 64-byte buffer whenever the memory address is within a WC or WT memory type region. The processor continues to combine writes to this buffer without writing the data to the system, as long as certain rules apply (see Table 9 on page 234 for more information). Programming Details The steps required for programming write combining on the AMD Athlon processor are as follows: 1. Verify the presence of an AMD Athlon processor by using the CPUID instruction to check for the instruction family code and vendor identification of the processor. Standard function 0 on AMD processors returns a vendor identification string of “AuthenticAMD” in registers EBX, EDX, and ECX. Standard function 1 returns the processor signature in register EAX, where EAX[11–8] contains the instruction family code. For the AMD Athlon processor, the instruction family code is six. 2. In addition, the presence of the MTRRs is indicated by bit 12 and the presence of the PAT extension is indicated by bit 16 of the extended features bits returned in the EDX register by CPUID function 8000_0001h. See the AMD Processor Recognition Application Note, order# 20734 for more details on the CPUID instruction. 3. Write combining is controlled by the MTRRs and PAT. Write combining should be enabled for the appropriate memory ranges. The AMD Athlon processor MTRRs and PAT are compatible with the Pentium ® II. Write-Combining Operations In order to improve system performance, the AMD Athlon processor aggressively combines multiple memory-write cycles of any data size that address locations within a 64-byte write buffer that is aligned to a cache-line boundary. The data sizes can be bytes, words, longwords, or quadwords. WC memory type writes can be combined in any order up to a full 64-byte sized write buffer. WT memory type writes can only be combined up to a fully aligned quadword in the 64-byte buffer, and must be combined contiguously in ascending order. Combining may be opened at any byte boundary in a quadword, but is closed by a write that is either not “contiguous and ascending” or fills byte 7. All other memory types for stores that go through the write buffer (UC and WP) cannot be combined. Combining is able to continue until interrupted by one of the conditions listed in Table 9 on page 234. When combining is interrupted, one or more bus commands are issued to the system for that write buffer, as described by Table 10 on page 235. On December 06, 2002 at 14:04:07, Gerd Isenberg wrote: >Hi All, > >Some brainstorming, inspired by the recent postings of Miguel A. Ballicora, to >serialize a bitboard on block, producing a terminated string of target squares. > >http://www.talkchess.com/forums/1/message.html?268658 > >Another point is the kind of traversing bitboards during move generation, >already discussed recently by the bitboard connection. The nested while loops >over piece sets in the outer and attacking target squares in the inner loop. > >With Floodfill/Kogge-Stone Attack generation the attacked getter routines need a >bitboard (isolated by b&-b, due i need disjoint target sets, even if there are >possibly smarter approaches by Steffan Westcott) as parameter, rather than a >square index as integer. Is it necessary to do bitscans at all? > >As Sune and Russell already mentioned, one need to address the zobrist and >history tables. If one don't use from/to square coordinates, one need a hell >fast hash function (at least faster than two bitscans) that map two isolated >bitboards (difference?) to a quite short value range, in the manner of Walter >Faxon's magig LSB_64. > >http://www.talkchess.com/forums/1/message.html?268531 > >I tried that mod5811, but two parallized mmx-bitscans are clearly faster. > >Miguel's algorithm already acts as a kind of move-generator, if we provide some >approriate set of target squares and some preininialized move structure template >(most lokely 32 bits with 6 empty bits in a row). But Miguel needs two nested >file-rank loops, a lot of conditional stuff. The runtime behaviour may be >dependent on the bit positions. > >-- > >How about this MMX,3DNow! (Athlon only) approach, which is able to do a lot in >parallel and avoids most, if not all conditional jumps. The drawback are >possible multiple writes to the same address. > >What is more expensive, a conditional jump with the risk of branch misprediction >or a additional write with the same index during writing an array? > > >Following steps: >-- >0. some register initialations > // mm0 is the bitboards to serialize > // mm7 5f:7f constant to subtract bit0-offset > // ecx move list int32-pointer > >1. Isolating the bits in both 32-bit dwords. > // mm0 is the bitboards to serialize >doNscans: > pxor mm1, mm1 ; 0 > psubd mm1, mm0 ; -bb 32bitwise > pand mm1, mm0 ; bb & -bb > pxor mm0, mm1 ; reset isolated bits > >2. Two parallel bitscans with base two logarithms via PI2FD. > If no bit is left anymore in one dword we produce > a negative result there, where most upper bits set. > > PI2FD mm1, mm1 ; 0, 3f8..,400.. > pslld mm1, 1 ; clear sign bit > psrld mm1, 24 ; shift 3f8... to 7f (bit 0) > psubd mm1, mm7 ; - 5f:7f, considers +32 in highword > >3. Both dwords in mm1 contain signed square coordinates if >= 0 each. > Both dwords are moved to memory via register pointer ecx unconditionally. > But the pointer is incremented by (sq >= 0) rather than one. > > PMOVMSKB eax,mm1 ; get sign bits 123,567 !vector path instr.! > // time to (p)or some other information (source square) into mm1 > not al ; complement sign bits > mov edx, eax > and al, 0x04 ; 0x04*(ldword >= 0) considers sizeof(int32) > and dl, 0x40 ; 0x40*(hdword >= 0) > movd [ecx], mm1 ; write ldword > add ecx, eax ; moves += (ldword >= 0); > punpckhdq mm1, mm1 ; hdword, like >>= 32 > shr edx, 4 ; 0x04*(hdword >= 0) considers sizeof(int32) > movd [ecx], mm1 ; write hdword > add ecx, edx ; moves += (hdword >= 0); > >4. [optional performance step - loop unrolling] > To gain more parallelism und and to solve the dependency chains, > repeat the above sequence one to three times (2-4 in total), > with mm2...mm5 instead of mm1, and do some clever instruction mixing. > >5. If any bits left in mm0, goto 1.) > > pxor mm1, mm1 ; 0 > pcmpeqd mm1, mm0 ; == 0 > PMOVMSKB eax,mm1 ; get dwords to nibble > cmp eax, 0xff ; all zero ? > jne doNscans > >This step is even not necessary in a specialized version of this routine for >rook and bishop captures, if step 4 containes 3 unrolled repetitions. >This test may also be done at the beginning of the loop. > >6. Implicite popcount (return (ecx - initial pointer)/sizeof(int) ), so no >terminator necessary. > >-- > >There are two sources of parallelism. The first one the parallel SIMD processing >of two dwords per mmx-register. The performance decreases, if both 32bit dwords >have different popcounts. That implies even more dummy writes. > >The second one is the use of multiple mmx registers after resetting the previous >isolated one. But this sucks if the target sets are sparely populated. The >relative performance is obviosly depending on the populations count. > >Best case with four mmx-registers is finding eight squares per iteration and no >dummy write. Worst case (assume != 0) is only finding one square, but seven >dummy writes. No idea about the penaly of multiple writes to the same address. >Isn't it depending on cache strategy (write through, or dirty tag)? > >Due to parallel processing of low and high dwords, the generated target square >order may alternate between the two board halfs. > >Of course, i will try it in IsiChess. Bitboard parameter may be passed through >mmx-register from a previous mmx-attack getter. But in IsiChess during capture >generation, target sets are often rarely populated, due to generating captures >to one kind of piece per time. Have to think about... > >Cheers, >Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.