Author: Gerd Isenberg
Date: 07:54:35 07/22/04
Go up one level in this thread
On July 22, 2004 at 02:19:23, Walter Faxon wrote: >Finding the maximum value (only) in an array: improvements summary. > >(1) As per Bob Hyatt, test while the data is being added to the array. > >(2) As per Daniel Clausen, a small improvement is to start with an endpoint >value from the array. > >(3) Run the loop index downwards towards zero, avoiding the extra comparison >against the array dimension; instead you rely on the "negative" flag being >raised when the index is decremented below zero. (This can also be combined >with unrolling, below.) > >(4) Some chips prefer the use of pointers to indexes, especially depending on >the data sizing. > >(5) Tony Werten's idea is to add a sentinel value to avoid testing the index >unless a new high value is found. In random-ordered arrays this really doesn't >happen that often! But sometimes there is no unique sentinel value handy, or >any place to put it. (A small bug in Werten's code was fixed by Ricardo >Gibert.) > >(6) Ricardo Gibert cleverly avoids using an explicit sentinel by searching in >both directions, with the last maximum value found acting as a sentinel. The >only likely slowdown here is when there are many equal values, which produces >more index tests than Werten's code. > >(7) Some advanced chips have a so-called "conditional move" instruction that can >replace the maximum value without requiring a (slow) conditional branch. In >these cases you'll instead need to terminate the loop using the index value. (A >higher-level language like C requires you to rely on the compiler to implement >conditional moves.) (Can be combined with unrolling, below.) > >(8) Mathieu Page provides a bit hack which obtains the maximum of two signed >integers without using a (slow) conditional branch. But it's slower than a >conditional move if that's available, and less general (signed integers only). > >(9) "Unroll" the loop, having one loop index test control multiple data tests; >you'll need a switch() or some such to deal with odd cases unless the array size >is fixed or an exact multiple of the degree of unrolling. For arrays known to >be very small you can unroll all the way, avoiding index tests altogether except >perhaps for an initial switch() on a variable array size. > >(10) Alvaro Begue's suggestion of unrolling using multiple maximum values might >at first appear illogical since instead comparing everything against a single >value means fewer overall assignments, and you don't have to combine the >results. However, using multiple values also means fewer data dependencies in >the unrolled loop -- very important on current hardware. But it also requires >more registers... > >YMMV! YUM! > >-- Walter Hi Walter, since my focus is actually on SSE2 i like to add some code as well. Unfortunately there is only a 8*16bit PMAXSW, Packed Maximum Signed Words, or a 16*byte PMAXUB, Packed Maximum Unsigned Bytes, but no 32-bit max instruction. So lets use the formula: max(y,x) = x - ( ( (x-y)>>(WORDBITS-1) ) & (x-y)) One xmm register has four 32-bit ints. xmm0 := x3...x0 xmm1 := y3...y0 movdqa xmm2, xmm0 ; x psubd xmm2, xmm1 ; x-y movdqa xmm3, xmm2 ; x-y pshad xmm2, 31 ; (x-y)>>31 signed arithmetic shift pand xmm2, xmm3 psubd xmm0, xmm2 ; max(x,y) That should work for P4 as well for AMD64 in 32-or 64-bit mode. I suggest using four other xmm registers to handle another pair. Let say the int[100] array is an aligned __m128i[25] array: movdqa xmm0, [arr] movdqa xmm4, [arr+1*16] movdqa xmm2, xmm0 movdqa xmm6, xmm4 psubd xmm2, [arr+2*16] psubd xmm6, [arr+3*16] movdqa xmm3, xmm2 movdqa xmm7, xmm6 pshad xmm2, 31 pshad xmm6, 31 pand xmm2, xmm3 pand xmm6, xmm7 psubd xmm0, xmm2 psubd xmm4, xmm6 ... ; nine similar blocks movdqa xmm2, xmm0 movdqa xmm6, xmm4 psubd xmm2, [arr+22*16] psubd xmm6, [arr+23*16] movdqa xmm3, xmm2 movdqa xmm7, xmm6 pshad xmm2, 31 pshad xmm6, 31 pand xmm2, xmm3 pand xmm6, xmm7 psubd xmm0, xmm2 psubd xmm4, xmm6 movdqa xmm2, xmm0 psubd xmm2, [arr+24*16] movdqa xmm3, xmm2 pshad xmm2, 31 pand xmm2, xmm3 psubd xmm0, xmm2 // final vector max movdqa xmm2, xmm0 psubd xmm2, xmm4 movdqa xmm3, xmm2 pshad xmm2, 31 pand xmm2, xmm3 psubd xmm0, xmm2 // horizontal max movdqa xmm3, xmm0 movdqa xmm2, xmm0 psrlq xmm3, 32 psubd xmm2, xmm3 movdqa xmm3, xmm2 pshad xmm2, 31 pand xmm2, xmm3 psubd xmm0, xmm2 movdqa xmm3, xmm0 movdqa xmm2, xmm0 psrldq xmm3, 8 psubd xmm2, xmm3 movdqa xmm3, xmm2 pshad xmm2, 31 pand xmm2, xmm3 psubd xmm0, xmm2 ; max is in low dword of xmm0 It may really pay off in future AMD64 cores, if SSE2 become at least two times faster. Cheers, Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.