Computer Chess Club Archives

Search

Terms
Messages

Subject: What kind of memory do i have on the heap or stack?

Author: Gerd Isenberg
Date: 13:33:24 12/06/02
AMD Athlon Processor
x86 Code Optimization
Guide
TM

Implementation of
Write Combining (pg 231...)

Write-Combining Definitions and Abbreviations
This appendix uses the following definitions and abbreviations:
&#9632;UC—Uncacheable memory type
&#9632;WC—Write-combining memory type
&#9632;WT—Writethrough memory type
&#9632;WP—Write-protected memory type
&#9632;WB—Writeback memory type
&#9632;One Byte—8 bits
&#9632;One Word—16 bits
&#9632;Longword—32 bits (same as a x86 doubleword)
&#9632;Quadword—64 bits or 2 longwords
&#9632;Octaword—128 bits or 2 quadwords
&#9632;Cache Block—64 bytes or 4 octawords or 8 quadwords

What is Write Combining?

Write combining is the merging of multiple memory write
cycles that target locations within the address range of a write
buffer. The AMDAthlon processor combines multiple
memory-write cycles to a 64-byte buffer whenever the memory
address is within a WC or WT memory type region. The
processor continues to combine writes to this buffer without
writing the data to the system, as long as certain rules apply
(see Table 9 on page 234 for more information).

Programming Details

The steps required for programming write combining on the
AMD Athlon processor are as follows:

1. Verify the presence of an AMD Athlon processor by using
the CPUID instruction to check for the instruction family
code and vendor identification of the processor. Standard
function 0 on AMD processors returns a vendor
identification string of “AuthenticAMD” in registers EBX,
EDX, and ECX. Standard function 1 returns the processor
signature in register EAX, where EAX[11–8] contains the
instruction family code. For the AMD Athlon processor, the
instruction family code is six.

2. In addition, the presence of the MTRRs is indicated by bit
12 and the presence of the PAT extension is indicated by bit
16 of the extended features bits returned in the EDX
register by CPUID function 8000_0001h. See the AMD
Processor Recognition Application Note, order# 20734 for
more details on the CPUID instruction.

3. Write combining is controlled by the MTRRs and PAT.
Write combining should be enabled for the appropriate
memory ranges. The AMD Athlon processor MTRRs and
PAT are compatible with the Pentium ® II.


Write-Combining Operations

In order to improve system performance, the AMD Athlon
processor aggressively combines multiple memory-write cycles
of any data size that address locations within a 64-byte write
buffer that is aligned to a cache-line boundary. The data sizes
can be bytes, words, longwords, or quadwords.
WC memory type writes can be combined in any order up to a
full 64-byte sized write buffer.
WT memory type writes can only be combined up to a fully
aligned quadword in the 64-byte buffer, and must be combined
contiguously in ascending order. Combining may be opened at
any byte boundary in a quadword, but is closed by a write that is
either not “contiguous and ascending” or fills byte 7.
All other memory types for stores that go through the write
buffer (UC and WP) cannot be combined.
Combining is able to continue until interrupted by one of the
conditions listed in Table 9 on page 234. When combining is
interrupted, one or more bus commands are issued to the
system for that write buffer, as described by Table 10 on
page 235.


On December 06, 2002 at 14:04:07, Gerd Isenberg wrote:

>Hi All,
>
>Some brainstorming, inspired by the recent postings of Miguel A. Ballicora, to
>serialize a bitboard on block, producing a terminated string of target squares.
>
>http://www.talkchess.com/forums/1/message.html?268658
>
>Another point is the kind of traversing bitboards during move generation,
>already discussed recently by the bitboard connection. The nested while loops
>over piece sets in the outer and attacking target squares in the inner loop.
>
>With Floodfill/Kogge-Stone Attack generation the attacked getter routines need a
>bitboard (isolated by b&-b, due i need disjoint target sets, even if there are
>possibly smarter approaches by Steffan Westcott) as parameter, rather than a
>square index as integer. Is it necessary to do bitscans at all?
>
>As Sune and Russell already mentioned, one need to address the zobrist and
>history tables. If one don't use from/to square coordinates, one need a hell
>fast hash function (at least faster than two bitscans) that map two isolated
>bitboards (difference?) to a quite short value range, in the manner of Walter
>Faxon's magig LSB_64.
>
>http://www.talkchess.com/forums/1/message.html?268531
>
>I tried that mod5811, but two parallized mmx-bitscans are clearly faster.
>
>Miguel's algorithm already acts as a kind of move-generator, if we provide some
>approriate set of target squares and some preininialized move structure template
>(most lokely 32 bits with 6 empty bits in a row). But Miguel needs two nested
>file-rank loops, a lot of conditional stuff. The runtime behaviour may be
>dependent on the bit positions.
>
>--
>
>How about this MMX,3DNow! (Athlon only) approach, which is able to do a lot in
>parallel and avoids most, if not all conditional jumps. The drawback are
>possible multiple writes to the same address.
>
>What is more expensive, a conditional jump with the risk of branch misprediction
>or a additional write with the same index during writing an array?
>
>
>Following steps:
>--
>0. some register initialations
>        // mm0 is the bitboards to serialize
>	// mm7 5f:7f constant to subtract bit0-offset
>        // ecx move list int32-pointer
>
>1. Isolating the bits in both 32-bit dwords.
>        // mm0 is the bitboards to serialize
>doNscans:
>	pxor	mm1, mm1	; 0
>	psubd	mm1, mm0	; -bb 32bitwise
>	pand	mm1, mm0	; bb & -bb
>	pxor	mm0, mm1	; reset isolated bits
>
>2. Two parallel bitscans with base two logarithms via PI2FD.
>   If no bit is left anymore in one dword we produce
>   a negative result there, where most upper bits set.
>
>	PI2FD	mm1, mm1	; 0, 3f8..,400..
>	pslld	mm1, 1		; clear sign bit
>	psrld	mm1, 24		; shift 3f8... to 7f (bit 0)
>	psubd	mm1, mm7	; - 5f:7f, considers +32 in highword
>
>3. Both dwords in mm1 contain signed square coordinates if >= 0 each.
>   Both dwords are moved to memory via register pointer ecx unconditionally.
>   But the pointer is incremented by (sq >= 0) rather than one.
>
>	PMOVMSKB eax,mm1	; get sign bits 123,567 !vector path instr.!
>        // time to (p)or some other information (source square) into mm1
>	not	al		; complement sign bits
>	mov	edx, eax
>	and	al, 0x04	; 0x04*(ldword >= 0) considers sizeof(int32)
>	and	dl, 0x40	; 0x40*(hdword >= 0)
>	movd	[ecx], mm1	; write ldword
>	add	ecx, eax	; moves += (ldword >= 0);
>	punpckhdq mm1, mm1	; hdword, like >>= 32
>	shr	edx, 4		; 0x04*(hdword >= 0) considers sizeof(int32)
>	movd	[ecx], mm1	; write hdword
>	add	ecx, edx	; moves += (hdword >= 0);
>
>4. [optional performance step - loop unrolling]
>   To gain more parallelism und and to solve the dependency chains,
>   repeat the above sequence one to three times (2-4 in total),
>   with mm2...mm5 instead of mm1, and do some clever instruction mixing.
>
>5. If any bits left in mm0, goto 1.)
>
>	pxor	mm1, mm1	; 0
>	pcmpeqd	mm1, mm0	; == 0
>	PMOVMSKB eax,mm1	; get dwords to nibble
>	cmp	eax, 0xff	; all zero ?
>	jne	doNscans
>
>This step is even not necessary in a specialized version of this routine for
>rook and bishop captures, if step 4 containes 3 unrolled repetitions.
>This test may also be done at the beginning of the loop.
>
>6. Implicite popcount (return (ecx - initial pointer)/sizeof(int) ), so no
>terminator necessary.
>
>--
>
>There are two sources of parallelism. The first one the parallel SIMD processing
>of two dwords per mmx-register. The performance decreases, if both 32bit dwords
>have different popcounts. That implies even more dummy writes.
>
>The second one is the use of multiple mmx registers after resetting the previous
>isolated one. But this sucks if the target sets are sparely populated. The
>relative performance is obviosly depending on the populations count.
>
>Best case with four mmx-registers is finding eight squares per iteration and no
>dummy write. Worst case (assume != 0) is only finding one square, but seven
>dummy writes. No idea about the penaly of multiple writes to the same address.
>Isn't it depending on cache strategy (write through, or dirty tag)?
>
>Due to parallel processing of low and high dwords, the generated target square
>order may alternate between the two board halfs.
>
>Of course, i will try it in IsiChess. Bitboard parameter may be passed through
>mmx-register from a previous mmx-attack getter. But in IsiChess during capture
>generation, target sets are often rarely populated, due to generating captures
>to one kind of piece per time. Have to think about...
>
>Cheers,
>Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.