Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: A silly question about C (Max) -- Summary

Author: Gerd Isenberg

Date: 07:54:35 07/22/04

Go up one level in this thread


On July 22, 2004 at 02:19:23, Walter Faxon wrote:

>Finding the maximum value (only) in an array:  improvements summary.
>
>(1) As per Bob Hyatt, test while the data is being added to the array.
>
>(2) As per Daniel Clausen, a small improvement is to start with an endpoint
>value from the array.
>
>(3) Run the loop index downwards towards zero, avoiding the extra comparison
>against the array dimension; instead you rely on the "negative" flag being
>raised when the index is decremented below zero.  (This can also be combined
>with unrolling, below.)
>
>(4) Some chips prefer the use of pointers to indexes, especially depending on
>the data sizing.
>
>(5) Tony Werten's idea is to add a sentinel value to avoid testing the index
>unless a new high value is found.  In random-ordered arrays this really doesn't
>happen that often!  But sometimes there is no unique sentinel value handy, or
>any place to put it.  (A small bug in Werten's code was fixed by Ricardo
>Gibert.)
>
>(6) Ricardo Gibert cleverly avoids using an explicit sentinel by searching in
>both directions, with the last maximum value found acting as a sentinel.  The
>only likely slowdown here is when there are many equal values, which produces
>more index tests than Werten's code.
>
>(7) Some advanced chips have a so-called "conditional move" instruction that can
>replace the maximum value without requiring a (slow) conditional branch.  In
>these cases you'll instead need to terminate the loop using the index value.  (A
>higher-level language like C requires you to rely on the compiler to implement
>conditional moves.)  (Can be combined with unrolling, below.)
>
>(8) Mathieu Page provides a bit hack which obtains the maximum of two signed
>integers without using a (slow) conditional branch.  But it's slower than a
>conditional move if that's available, and less general (signed integers only).
>
>(9) "Unroll" the loop, having one loop index test control multiple data tests;
>you'll need a switch() or some such to deal with odd cases unless the array size
>is fixed or an exact multiple of the degree of unrolling.  For arrays known to
>be very small you can unroll all the way, avoiding index tests altogether except
>perhaps for an initial switch() on a variable array size.
>
>(10) Alvaro Begue's suggestion of unrolling using multiple maximum values might
>at first appear illogical since instead comparing everything against a single
>value means fewer overall assignments, and you don't have to combine the
>results.  However, using multiple values also means fewer data dependencies in
>the unrolled loop -- very important on current hardware.  But it also requires
>more registers...
>
>YMMV!  YUM!
>
>-- Walter

Hi Walter,

since my focus is actually on SSE2 i like to add some code as well.
Unfortunately there is only a 8*16bit PMAXSW, Packed Maximum Signed Words, or a
16*byte PMAXUB, Packed Maximum Unsigned Bytes, but no 32-bit max instruction.

So lets use the formula:

max(y,x) = x - ( ( (x-y)>>(WORDBITS-1) ) & (x-y))

One xmm register has four 32-bit ints.

   xmm0 := x3...x0
   xmm1 := y3...y0

   movdqa xmm2, xmm0  ; x
   psubd  xmm2, xmm1  ; x-y
   movdqa xmm3, xmm2  ; x-y
   pshad  xmm2, 31    ; (x-y)>>31 signed arithmetic shift
   pand   xmm2, xmm3
   psubd  xmm0, xmm2  ; max(x,y)

That should work for P4 as well for AMD64 in 32-or 64-bit mode.
I suggest using four other xmm registers to handle another pair.
Let say the int[100] array is an aligned __m128i[25] array:

   movdqa xmm0, [arr]
   movdqa xmm4, [arr+1*16]

   movdqa xmm2, xmm0
   movdqa xmm6, xmm4
   psubd  xmm2, [arr+2*16]
   psubd  xmm6, [arr+3*16]
   movdqa xmm3, xmm2
   movdqa xmm7, xmm6
   pshad  xmm2, 31
   pshad  xmm6, 31
   pand   xmm2, xmm3
   pand   xmm6, xmm7
   psubd  xmm0, xmm2
   psubd  xmm4, xmm6

   ... ; nine similar blocks

   movdqa xmm2, xmm0
   movdqa xmm6, xmm4
   psubd  xmm2, [arr+22*16]
   psubd  xmm6, [arr+23*16]
   movdqa xmm3, xmm2
   movdqa xmm7, xmm6
   pshad  xmm2, 31
   pshad  xmm6, 31
   pand   xmm2, xmm3
   pand   xmm6, xmm7
   psubd  xmm0, xmm2
   psubd  xmm4, xmm6

   movdqa xmm2, xmm0
   psubd  xmm2, [arr+24*16]
   movdqa xmm3, xmm2
   pshad  xmm2, 31
   pand   xmm2, xmm3
   psubd  xmm0, xmm2

   // final vector max
   movdqa xmm2, xmm0
   psubd  xmm2, xmm4
   movdqa xmm3, xmm2
   pshad  xmm2, 31
   pand   xmm2, xmm3
   psubd  xmm0, xmm2

   // horizontal max
   movdqa xmm3, xmm0
   movdqa xmm2, xmm0
   psrlq  xmm3, 32
   psubd  xmm2, xmm3
   movdqa xmm3, xmm2
   pshad  xmm2, 31
   pand   xmm2, xmm3
   psubd  xmm0, xmm2

   movdqa xmm3, xmm0
   movdqa xmm2, xmm0
   psrldq  xmm3, 8
   psubd  xmm2, xmm3
   movdqa xmm3, xmm2
   pshad  xmm2, 31
   pand   xmm2, xmm3
   psubd  xmm0, xmm2  ; max is in low dword of xmm0

It may really pay off in future AMD64 cores,
if SSE2 become at least two times faster.

Cheers,
Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.