Computer Chess Club Archives

Search

Terms
Messages

Subject: why loop unrolling isn't as useful on x86 as it once was

Author: Wylie Garvin
Date: 04:29:20 02/07/02
Hi,

  Back when Pentiums and Pentium MMX's ruled the world, it made sense to unroll
loops.  Unrolling could reduce the loop cost (since branch prediction especially
on the non-MMX Pentiums sucked ass, and earlier processors didn't have it at
all).  It could help find pairs for non-pairable instructions which was
important (remember, Pentiums had a U pipe and a V pipe).  Plus the end of one
iteration could be overlapped with the beginning of the next to increase pairing
opportunities and avoid AGI's (this is called convolution).

  But with modern processors (from the Pentium II onwards), unrolling has not
made so much sense.  The processor uses register renaming and out-of-order
execution.  Scheduling instructions in a 4-1-1 pattern for decoding is important
(at least before Pentium 4's), but unrolling usually doesn't make this easier.
The processor's out-of-order execution means that convolution happens
automagically.  And the branch prediction of modern Pentiums is superb, so the
loop overhead of any non-trivial loop represents only a small percentage of the
loop's cost.

  On the other hand, unrolling a loop increases (architecturally visible)
register pressure.  This may interfere with proper instruction scheduling, or
(more likely) cause the compiler to spill one or more temporaries.  The spilled
temporaries will live in the L1 cache but even a single spilled register can end
up costing *more* than the loop overhead saved by unrolling.  In fact the loop
overhead saved by unrolling is likely less than 2 cycles.

  The only case where it makes *sense* to unroll a loop when targeting a Pentium
2/3 or other modern x86 CPU is when the loop has a long dependence chain in it
(a long sequence of instructions, or a sequence containing long-latency
instructions like divides, where each instruction in the sequence depends on the
results of the previous instruction).  In this case, the CPU has to wait for the
results of each instruction to be ready before executing the next one.  If you
unroll the loop 2 times (which you can even do by hand, in your C code), you may
give the CPU enough to keep it busy while those long-latency instructions are
completing.  For example, on P2/P3's, a multiply has a latency of like 4 cycles
(if I remember correctly ;), but the *throughput* is 1 cycle (i.e. it can issue
a new multiply each cycle and be working on up to 4 of them simultaneously).

  Anyway--premature optimization is the root of all evil!  You shouldn't fool
around with loop unrolling until you have profiled your program and *proved*
that some function with a loop in it is responsible for >10% of CPU usage.  And
then you will *know* that optimizing that little piece is worth it, and you
might write it in assembly or optimize the compiler's assembly or tweak its
switches or find a better algorithm or whatever.  Though I should eat my words,
since my current hobby is writing a chess engine entirely in assembly.  Which
will take me 100 times longer than just writing it in C, and be *at most* 2-3x
faster.  So far, it doesn't contain a single unrolled loop..  =)

wylie
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.