Author: Wylie Garvin
Date: 04:29:20 02/07/02
Go up one level in this thread
Hi, Back when Pentiums and Pentium MMX's ruled the world, it made sense to unroll loops. Unrolling could reduce the loop cost (since branch prediction especially on the non-MMX Pentiums sucked ass, and earlier processors didn't have it at all). It could help find pairs for non-pairable instructions which was important (remember, Pentiums had a U pipe and a V pipe). Plus the end of one iteration could be overlapped with the beginning of the next to increase pairing opportunities and avoid AGI's (this is called convolution). But with modern processors (from the Pentium II onwards), unrolling has not made so much sense. The processor uses register renaming and out-of-order execution. Scheduling instructions in a 4-1-1 pattern for decoding is important (at least before Pentium 4's), but unrolling usually doesn't make this easier. The processor's out-of-order execution means that convolution happens automagically. And the branch prediction of modern Pentiums is superb, so the loop overhead of any non-trivial loop represents only a small percentage of the loop's cost. On the other hand, unrolling a loop increases (architecturally visible) register pressure. This may interfere with proper instruction scheduling, or (more likely) cause the compiler to spill one or more temporaries. The spilled temporaries will live in the L1 cache but even a single spilled register can end up costing *more* than the loop overhead saved by unrolling. In fact the loop overhead saved by unrolling is likely less than 2 cycles. The only case where it makes *sense* to unroll a loop when targeting a Pentium 2/3 or other modern x86 CPU is when the loop has a long dependence chain in it (a long sequence of instructions, or a sequence containing long-latency instructions like divides, where each instruction in the sequence depends on the results of the previous instruction). In this case, the CPU has to wait for the results of each instruction to be ready before executing the next one. If you unroll the loop 2 times (which you can even do by hand, in your C code), you may give the CPU enough to keep it busy while those long-latency instructions are completing. For example, on P2/P3's, a multiply has a latency of like 4 cycles (if I remember correctly ;), but the *throughput* is 1 cycle (i.e. it can issue a new multiply each cycle and be working on up to 4 of them simultaneously). Anyway--premature optimization is the root of all evil! You shouldn't fool around with loop unrolling until you have profiled your program and *proved* that some function with a loop in it is responsible for >10% of CPU usage. And then you will *know* that optimizing that little piece is worth it, and you might write it in assembly or optimize the compiler's assembly or tweak its switches or find a better algorithm or whatever. Though I should eat my words, since my current hobby is writing a chess engine entirely in assembly. Which will take me 100 times longer than just writing it in C, and be *at most* 2-3x faster. So far, it doesn't contain a single unrolled loop.. =) wylie
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.