Branch prediction



         


In computer architecture, a branch predictor is the part of a processor that determines whether a branch in the instruction flow of a program is likely to be taken or not. This is called branch prediction. Branch predictors are crucial in today's modern, superscalar processors for achieving high performance. They allow processors to fetch and execute instructions without waiting for a branch to be resolved.

All pipelined processors do branch prediction of some form, because they must guess the address of the next instruction to fetch before the current instruction has been executed. Earlier microprogrammed CPUs did not have to do branch prediction.

[Top]

Trivial prediction

The early implementations of SPARC and MIPS (two of the first commercial RISC architectures) did trivial branch prediction: they always predicted that a branch would not be taken, so they always fetched the next sequential instruction. Only when a branch or jump was evaluated did the instruction fetch pointer get set to a nonsequential address.

Both CPUs evaluated branches in the decode stage and had a single cycle instruction fetch. As a result, the branch target recurrence was two cycles long, and the machine would always fetch the instruction immediately after any taken branch. Both architectures defined branch delay slots in order to utilize these fetched instructions.

[Top]

Next line prediction

Some superscalar processors, (MIPS R8000, Alpha EV6 and EV8), fetched with each line of instructions a pointer to the next line. This next line predictor is not directly comparable to the other predictors listed here, because it handles both the branch target prediction as well as the branch direction prediction.

When a next line predictor points to aligned groups of 2, 4 or 8 instructions, the branch target will usually not be the first instruction fetched, and so the initial instructions fetched are wasted. Assuming for simplicity a uniform distribution of branch targets, 0.5, 1.5, and 3.5 instructions fetched are discarded, respectively.

Since the branch itself will generally not be the last instruction in an aligned group, instructions after the taken branch (or it's delay slot) will be discarded. Once again assuming a uniform distribution of branch instruction placements, 0.5, 1.5, and 3.5 instructions fetched are discarded.

The discarded instructions at the branch and destination lines add up to nearly a complete fetch cycle, even for a single-cycle next-line predictor.

[Top]

Bimodal branch prediction

A bimodal branch predictor has a table of two-bit saturating counters, indexed with the least significant bits of the instruction addresses. Unlike the instruction cache, bimodal predictor entries typically do not have tags, and so a particular counter may correspond to two branches, in which case it is likely to be less accurate. Each counter has one of four states:

When a branch is evaluated, the corresponding counter is updated. Branches evaluated as not taken decrement the state towards strongly not taken, and branches evaluated as taken increment the state towards strongly taken. The primary benefit of this two bit saturating counter scheme is that loop closing branches are always predicted taken. A one-bit scheme (like the R8000), mispredicts both the first and last branch of a loop. A two-bit scheme mispredicts just the last branch. Similarly, on heavily biased branches which almost always go one way, a one-bit scheme mispredicts twice for each odd branch, and a two-bit scheme mispredicts once.

On the SPEC'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter.

Because the bimodal counter table is indexed with the instruction address bits, a superscalar processor can split the table into separate SRAMs for each instruction fetched, and fetch a prediction for every instruction in parallel with fetching the instruction, so that the branch prediction is available as soon as the branch is decoded.

[Top]

Local branch prediction

Bimodal branch prediction mispredicts the exit of every loop. For loops which tend to have the same loop count every time (and for many other branches with repetitive behavior), we can do much better.

Local branch predictors keep two tables. The first table is the local branch history table. It is indexed by the low-order bits of the branch instruction's address, and it records the taken/not-taken history of the n most recent executions of each branch.

The other table is the pattern history table. Like the bimodal predictor, this table contains bimodal counters; however, its index is generated from the branch history in the first table. To predict a branch, the branch history is looked up, and that history is then used to look up a bimodal counter which makes a prediction.

On the SPEC'89 benchmarks, very large local predictors saturate at 97.1% correct.

Local prediction is slower than bimodal prediction because it requires two sequential table lookups for each prediction. A fast implementation would use a separate bimodal counter array for each instruction fetched, so that the second array access can proceed in parallel with instruction fetch. These arrays are not redundant, as each counter is intended to store the behavior of a single branch.

[Top]

Global branch prediction

Global branch predictors make use of the fact that the behavior of many branches is strongly correlated with the history of other recently taken branches. We can keep a single shift register updated with the recent history of every branch executed, and use this value to index into a table of bimodal counters. This scheme, by itself, is only better than the bimodal scheme for large table sizes, and is never as good as local prediction.

If, instead, we index the table of bimodal counters with the recent history concatenated with a few bits of the branch instruction's address, we get the gselect predictor. Gselect does better than local prediction for small table sizes, and local prediction is only slightly better for table storage larger than 1KB.

We can do slightly better than gselect by XORing the branch instruction address with the global history, rather than concatenating. The result is gshare, which is a little better than gselect for tables larger than 256 bytes.

On the SPEC'89 benchmarks, very large gshare predictors saturate at 96.6% correct, which is just a little worse than large local predictors.

Gselect and gshare are easier to make fast than local prediction, because they require a single table lookup per branch. As with bimodal prediction, the table can be split so that parallel lookups can be made for each instruction fetched, so that the table lookup can proceed in parallel with instruction load.

[Top]

Combined branch prediction






  View Live Article   This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License