Recent Articles



































Compiler optimization



         


Compiler optimization is used to improve the efficiency (in terms of running time or resource usage) of the executables output by a compiler. These techniques allow programmers to write source code in a straightforward manner, expressing their intentions clearly, while allowing the computer to make choices about implementation details that lead to efficient execution. Contrary to what the term might imply, this rarely results in executables that are perfectly "optimal" by any measure, only executables that are much improved compared to direct translation of the programmer's original source.

[Top]

Problems with Optimization

Further problems with optimizing compilers are:

This is where so-called "post pass" optimizers come in. These tools take the executable output by an "optimizing" compiler and optimize it even further (see http://www.absint.com/aipop for an example of such tool). As opposed to compilers which optimize intermediate representations of programs, post pass optimizers work on the Assembly language level. Post pass compilers are also limited by this, however, in that much of the information found in the original source code is lost.

Early in the history of compilers, compiler optimizations were not as good as hand-written ones. As compiler technologies have improved, good compilers can often generate better code than human programmers — and good post pass optimizers can improve highly hand-optimized code even further. For the RISC CPU architecture, and even more so for VLIW hardware, compiler optimization is the key for obtaining an optimal code, because the RISC instruction set is so compact that it is hard for a human to manually schedule or combine small instructions to get efficient results. Indeed, these architectures were designed to rely on compiler writers for adequate performance.

[Top]

Types of Optimizations

Techniques in optimization can be broken up along various dimensions:

Local techniques tend to be easier to implement, but result in lesser gains, while global techniques make the opposite tradeoff.

These dimensions aren't completely orthogonal. It is often the case that machine dependent optimizations are local, for example.

The following is an instance of a local machine dependent optimization. To set a register to 0, the obvious way is to use the constant 0 in an instruction that sets a register value to a constant. A less obvious way is to XOR a register with itself. It is up to the compiler to know which instruction variant to use. On many RISC machines, both instructions would be equally appropriate, since they would both be the same length and take the same time. On many other microprocessors such as the Intel x86 family, it turns out that the XOR variant is shorter and probably faster, as there will be no need to decode an immediate operand, nor use the internal "immediate operand register".

[Top]

Factors affecting optimization

The machine itself

Many of the choices about which optimizations can and should be done depend on the characteristics of the target machine. It is sometimes possible to parameterize some of these machine dependent factors, so that a single piece of compiler code can be used to optimize different machines just by altering the machine description parameters. See for example GCC for a compiler that exemplifies this approach.

The architecture of the target CPU
Compilers can schedule, or reorder, instructions so that pipeline stalls occur less frequently.
Here again, instructions have to be scheduled so that the various functional units are fully fed with instructions to execute.
The architecture of the machine
Extract more information from code 
The more information the compiler has, the better it can optimize.
[Top]

Optimization techniques

[Top]

Loop Optimizations

Some optimization techniques primarily designed to operate on loops include:

strength reduction, and also may allow the index variable's definitions to become dead code. This information is also useful for bounds-checking elimination and dependence analysis, among other things.
locality of reference, both of the data being accessed in the loop and the code in the loop's body.
loop fusion or loop combining 
Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data.
locality of reference, depending on the array's layout.
loop-invariant code motion 
If a quantity is computed inside a loop during every iteration, and its value is the same for each iteration, it can vastly improve efficiency to hoist it outside the loop and compute its value just once before the loop begins. This is particularly important with the address-calculation expressions generated by loops over arrays.
dependencies and thus enable other optimizations.
loop splitting 
Loop splitting attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range. A useful special case is loop peeling, which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop.
loop unswitching 
Unswitching moves a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional.
[Top]

Data-flow Optimizations

Data flow optimizations primarily depend on how certain properties of data are propagated by control edges in the control flow graph. Some of these include:

common subexpression elimination 
In the expression "(a+b)-(a+b)/4", "common subexpression" refers to the duplicated "(a+b)". Compilers implementing this technique realize that "(a+b)" won't change, and as such, only calculate its value once.
constant folding and propagation 
replacing expressions consisting of constants (e.g. "3 + 5") which their final value ("8") rather than doing the calculation in run-time. Used in most modern languages.
[Top]

SSA-based Optimizations

These optimizations are intended to be done after transforming the program into a special form called static single assignment (see SSA form), in which every variable is assigned in only one place. Although some function without SSA, they are most effective with SSA. Many optimizations listed in other sections also benefit with no special changes, such as register allocation.

global value numbering: GVN eliminates redundancy by constructing a value graph of the program, and then determining which values are computed by equivalent expressions. GVN is able to identify some redundancy that common subexpression elimination cannot, and vice versa.

constant propagation, constant folding, and dead code elimination until there is no change, but is much more efficient. This optimization symbolically executes the program, simultaneously propagating constant values and eliminating portions of the control flow graph that this makes unreachable.

[Top]

Back End Optimizations

register allocation: The most frequently used variables should be kept in processor registers for fastest access. To find which variables to put in registers a interference-graph is created. Each variable is a vertex and when two variables are used at the same time (have an intersecting liverange) they have an edge between them. This graph is colored using for example Chaitin's algorithm using the same number of colors as there are registers. If the coloring fails one variable is "spilled" to memory and the coloring is retried.

CISC architectures and those with many addressing modes, offer several different ways of performing a particular operation, using entirely different sequences of instructions. The job of the instruction selector is to do a good job overall of choosing which instructions to implement which operators in the low-level intermediate representation with. For example, on many processors in the 68000 family, complex addressing modes can be used in statements like "lea 25(a1,d5*4), a0", allowing a single instruction to perform a significant amount of arithmetic with less storage.

pipelined processors, which avoids stalls or bubbles in the pipeline by clustering instructions with no dependencies together, while being careful to preserve the original semantics.

[Top]

Functional Language Optimizations

Although many of these also apply to non-functional languages, they either originate in, are most easily implemented in, or are particularly critical in functional languages such as Lisp and ML.

removing recursion 
Recursion is often expensive, as a function call consumes stack space and involves some overhead related to parameter passing and flushing the instruction cache. Tail recursive algorithms can be converted to iteration, which does not have call overhead and uses a constant amount of stack space, through a process called tail recursion elimination.
[Top]

Other Optimizations

Please help separate and categorize these further and create detailed pages for them, especially the more complex ones, or link to one where one exists.

Java, enforce bounds-checking of all array accesses. This is a severe performance bottleneck on certain applications such as scientific code. Bounds-checking elimination allows the compiler to safely remove bounds-checking in many situations where it can determine that the index must fall within valid bounds, for example if it is a simple loop variable.
Branch offset optimization (machine independent): Choose the shortest branch displacement that reaches target
code-block reordering: Code-block reordering alters the order of the basic blocks in a program in order to reduce conditional branches and improve locality of reference.
dead code elimination: Removes instructions that will not affect the behaviour of the program, for example definitions which have no uses, called dead code. This reduces code size and eliminates unnecessary computation.
factoring out of invariants 
If an expression is carried out both when a condition is met and otherwise, it can be written just once outside of the conditional statement. Similarly, if certain types of expressions (e.g. the assignment of a constant into a variable) appear inside a loop, they can be moved out of it because their effect will be the same no matter if they're executed many times or just once. Also known as total redundancy elimination. A more powerful optimization is partial redundancy elimination (PRE).
in-line expansion 
When some code invokes a procedure, it is possible to directly insert the body of the procedure inside the calling code rather than transferring control to it. This saves the overhead related to procedure calls, as well as providing great opportunity for many different parameter-specific optimizations, but comes at the cost of space; the procedure body is duplicated each time the procedure is called inline. Generally, inlining is useful in performance-critical code that makes a large number of calls to small procedures.
jump threading: In this pass, conditional jumps in the code that branch to identical or inverse tests are detected, and can be "threaded" through a second conditional test.
strength reduction 
A general term encompassing optimizations that replace complex or difficult or expensive operations with simpler ones. Induction variable analysis can accomplish this with variables inside a loop which depend on the loop variable. Several lazy evaluation, but can be used only when the tests are not dependent on one another. Short-circuiting semantics can make this difficult.
unreachable code elimination: Unreachable code refers to code which can never be executed. Unreachable code elimination prevents the compiler from emitting instructions for such code, decreasing code size.
[Top]




  View Live Article   This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License