Control Hazards
This is lecture from my old class notes; it is more in line with my research
point of view and less consistent with your text, but it is a good alternate
introduction to branch prediction
Control Hazards
Instructions that disrupt the sequential flow of control present problems
for pipelines. The effects of these instructions can't be exactly determined
until late in the pipeline, so instruction fetch can't continue unless we
do something special. The following types of instructions can introduce
control hazards:
- Unconditional branches.
- Conditional branches.
- Indirect branches.
- Procedure calls.
- Procedure returns.
Let's see how control hazards are manifested with an example:
ld r2, 0(r4) // r2 := memory at r4
ld r3, 4(r4) // r3 := memory at r4+4
sub r1, r2, r3 // r1 := r2 - r3
beqz r1, L1 // if r1 is not 0, goto L1
ldi r1, 1 // r1 := 1
L1: not r1, r1 // r1 := not r1
st r1, 0(r5) // store r1 to memory at r5
This code has the effect of comparing two memory locations and storing
the result of that comparison (1 for equal, 0 for not equal) to another
location. If the beqz branch is taken, then a 1 is stored;
otherwise, a 0 is stored. Let's assume we have a five-stage pipeline
as in the last class.
Cycles----->
________________________ Instructions
|_IF_|_ID_|_EX_|_MM_|_WB_|_____ ld r2, 0(r4)
|_IF_|_ID_|_EX_|_MM_|_WB_|_________ ld r3, 4(r4)
|_IF_|_EX |_ID_|_EX_|_MM_|_WB_|____ sub r1, r2, r3
|_IF_|_EX |_ID_|_EX_|_MM_|_WB_|____ beqz r1, L1
|_??_|_?? |_??_|_??_|_??_|_??_|____ ???
|_??_|_?? |_??_|_??_|_??_|_??_|???
beqz and sub
instructions. Since the beqz instruction is a control instruction,
there are two problems here:
- When the beqz instruction is in the decode stage, the
sub instruction is in the execute stage. The branch can't
read the output of the sub until it has been written to the
register file; if it reads it early, it will read the wrong value.
How do we fix this?
- More importantly, one cycle after the beqz instruction is
fetched, what instruction should be fetched next? We don't know, because
that depends on the outcome of the branch instruction. At this point,
we don't even know that the branch instruction depends on the previous
instruction, since it hasn't been decoded yet, so bypass can't help us.
Even if we did know the condition, we still don't know where to fetch
from if the branch is taken because the effective address computation for
branches doesn't happen until the EX stage.
This little example illustrates the problem with branches in pipelines.
Two unresolved issues in the pipeline cause control hazards: branch outcome
(e.g., taken or not taken) and branch target (i.e. computing the next
value for the PC register).
Solutions for Control Hazards
The following are solutions that have been proposed for mitigating aspects
of control hazards:
- Pipeline stall cycles. Freeze the pipeline until the branch outcome and
target are known, then proceed with fetch. Thus, every branch instruction
incurs a penalty equal to the number of stall cycles. This solution is
unsatisfactory if the instruction mix contains many branch instructions,
and/or the pipeline is very deep.
- Branch delay slots. The ISA is constructed such that one or more
instructions sequentially following a conditional branch instruction are
executed whether or not the branch is taken. The compiler or assembly
language writer must fill these branch delay slots with useful
instructions or NOPs (no-operation opcodes). This solution doesn't extend
well to deeper pipelines, and becomes architectural baggage that the ISA
must carry into future implementations.
- Branch prediction. The outcome and target of conditional branches are
predicted using some heuristic. Instructions are speculatively fetched and
executed down the predicted path, but results are not written back to the
register file until the branch is executed and the prediction is verified.
When a branch is predicted, the processor enters a speculative
mode in which results are written to another register file that
mirrors the architected register file. Another pipeline stage called the
commit stage is introduced to handle writing verified speculatively
obtained results back into the "real" register file. Branch predictors can't
be 100% accurate, so there is still a penalty for branches that is based on
the branch misprediction rate.
- Indirect branch prediction. Branches such as virtual method calls,
computed gotos and jumps through tables of pointers can be predicted
using various techniques.
- Return address stack (RAS). Procedure returns are a form of indirect
jump that can be perfectly predicted with a stack as long as the call
depth doesn't exceed the stack depth. Return addresses are pushed onto
the stack at a call and popped off at a return.
Branch Prediction Strategies
Let's take a look at some strategies that have been proposed for conditional
branch prediction. There are two issues: outcome prediction and target
prediction. The most important aspect of a branch prediction strategy is
accuracy. Higher levels of sophistication in branch prediction strategies
lead to higher accuracy.
Branch Direction Prediction
Here are some ideas for branch outcome prediction
(or "branch direction prediction"), the problem of predicting whether a
conditional branch will be taken or not taken.
- Static branch prediction. Static branch prediction means that the
prediction is the same every time a branch is fetched. There are three
levels of static branch prediction strategies:
- Always taken/Always not taken. Every branch is predicted to go
a particular way. Implementing an "always not taken" strategy is
the simplest thing to do, since it eliminates the need for target
prediction. Instructions are fetched and speculatively executed
down the not-taken path. The accuracy of this technique is low;
however, it can be improved with compiler cooperation.
- Heuristics depending on the instruction. For instance, a
popular strategy is "backward taken/forward not taken." Backward
branches are assumed to be loop back edges that are frequently
taken, while forward branches are assumed to be frequently not taken.
This strategy is more accurate.
- Hint bits in the ISA. Some ISAs allow conditional branch
instructions to include hint bits that let the microarchitecture know
which direction the branch is likely to be taken. These hints bits
can be set through smart compiler heuristics, or through training
in a profiling run of the program. This technique achieves about
90% accuracy in the SPEC CPU integer benchmarks.
- Dynamic branch prediction. Every time a branch is executed, a machine
algorithm gives a prediction for the branch. Dynamic branch prediction
can predict branches that static branch prediction can't, for instance,
loop back edges with fixed trip counts. Let's look at some increasingly
complex dynamic branch predictors to see how they achieve high accuracy:
- Bimodal (or "Smith") predictors. A table of bits, indexed
by branch PC, is kept. Each time a branch executes, the corresponding
bit is set to 1 if the branch was taken, 0 otherwise. The next time
the branch needs to be predicted, the corresponding bit is taken
as the prediction. This is like the hint bits in the ISA for
static prediction, but the hint bits are learned on-line, without
a separate profiling run. Note that this system will predict highly
biased branches well, but for instance a loop back edge with a low
trip count will be mispredicted twice: once at the loop exit, and once
the very next time the back edge is encountered.
- Bimodal prediction with hyseresis. Instead of keeping a table
of bits, we keep a table of two-bit counters that saturate at 3 and
0. When a branch is taken, the corredponding counter is incremented.
When the branch is not taken, the counter is decremented. When a branch
needs to be predicted, the high bit of the corredponding counter is
taken as the prediction. The low bit provides hysteresis,
i.e., takes into account the history of the branch beyond just the
last time the branch was executed. So a loop back edge will incur only
one misprediction at a loop exit.
- Two-level adaptive branch prediction. A history shift register
is kept, either for all branches (global) or for each branch (local).
The length of this register is called the history length.
As branches are executed, their outcomes (1 for taken, 0 for not
taken) are shifted into this register. The register is used to
index a table of two-bit saturating counters to do prediction
with hysteresis. Once this scheme has gone through a sufficient
(small) training period, it can perfectly predict loop back edges
with small fixed trip counts. This scheme is highly susceptible
to aliasing in which two unrelated branches contend for
the same counter, resulting in reduced accuracy. Many ways to
address aliasing have been proposed. Two-level branch prediction
is present in all modern general-purpose microprocessors. How many
table entrie are required for a history length of N?
- Hybrid branch prediction. Two or more predictors are combined
to provide a more accurate estimate. For instance, local and
global two-level predictors can be combined, or two global predictors
with different history lengths.
- Other predictors. Neural predictors, predictors based
on decision trees, or finite state machines learned off-line,
combinations of static and dynamic prediction, etc.
Let's look at an example of how two-level prediction works. We'll assume a
history length of 4. Consider the following C code:
int A[100][3];
int main () {
int i, j;
for (i=0; i<100; i++) for (j=0; j<3; j++) A[i][j] = 0;
}
The history of the inner loop looks like this:
110110110110110110...
Let's assume the 4-bit history register for that branch is initialized to
all bits 0. As we go through the loops, the history would shift through
these combinations:
history outcome
0000 1
0001 1
0011 0
0110 1
1101 1
1011 0
0110 1
1101 1
1011 0
0110 1
1101 1
1011 0
...
So the table of 16 counters for that branch would look like this:
0000 01 // only on warm-up
0001 01 // only on warm-up
0010 00 // never happens
0011 00 // never happens
0100 00 // never happens
0101 00 // never happens
0110 11 // predict the back edge taken
0111 00 // never happens
1000 00 // never happens
1001 00 // never happens
1010 00 // never happens
1011 00 // predict the back edge not taken
1100 00 // never happens
1101 11 // predict the back edge taken
1110 00 // never happens
1111 00 // never happens
What about the outer loop? To learn that loop's behavior perfectly, we
would need a history length of at least 100. That would be impossible with
two-level prediction. However, it doesn't hurt us much if we mispredict
that branch once in the entire run of the function; we will predict it
correctly 99% of the time.
Branch Target Prediction
Predicting branch targets can be just as important as predicting branch
directions. For jumps and taken branches, or any other time when the
flow of control is no longer sequential, the address of the next instruction
must be given to the fetch stage as quickly as possible. Some techniques
for predicting targets:
- Return address stack (we have seen).
- Branch target buffer (BTB). Let's think of conditional branches only
for now. A BTB keeps a table of branch targets indexed by the branch PC.
When a branch is taken, the computed target is associated with the branch PC
in the BTB. When the branch needs to be predicted, the predicted target
is taken from the BTB. Note that the BTB only stores taken targets;
non-taken branches simply proceed to the next instruction, like any other
non-control instruction.
The table keeps tags, much like a cache, to ensure that the branch
PC and target address are not just coincidentally aliased to one another.
The tags may be some or all of the bits of the branch PC. BTBs can also
be set-associative, that is, the BTB can be a table of sets of entries,
to get around aliasing problems. Note that this scheme can be directly
extended to non-conditional branches that have only one target.
- Indirect branch prediction. Predicting the targets of indirect branches
such as virtual method calls can be done using a mechanism similar to
two-level adaptive branch prediction. Instead of indexing a table of two-bit
counters, we can index a table of BTB entries with a combination of history
and address. By storing a possibly different address with each history,
we can store multiple targets and get good correlation between history
and target. Note that BTBs that use full branch addresses as tags don't
need to have their targets verified, while indirect branch predictors do.