Lecture 3: Pipelining and Instruction-Level Parallelism
Many things that we study in computer structures are of the following
form: Here's a great idea, but there are all these problems in actually
implementing it so let's spend 1% of the time admiring the great idea and
99% of the time figuring out how to get it to work. Pipelining is a great
example of this.
Motivating Example
The goal with pipelining is to speed things up. More specifically, we want
to increase the throughput of the system, where throughput is defined as
the amount of work performed per unit time. Let's look at a very simple
example of how pipelining does this:
The Laundry Example
Suppose the work we want to do is laundry. There are three main steps in
this process:
- (W) Wash the clothes in the washing machine (takes 35 minutes)
- (D) Dry the clothes in the dryer (takes 45 minutes)
- (F) Fold and store the clothes (takes 20 minutes)
We have a lot of clothes to wash, so not all clothes will fit into the
washing machine. Thus, we will have to divide the work into several loads.
One way to wash all the clothes is sequentially:
W-D-F-W-D-F-W-D-F-W-D-F-W-D-F.....
Suppose that the washer can accomodate 20 items of clothing in one load,
and that we have 200 items total to be washed. How long will it take to
wash them? We will have to do 10 loads, and each one will have a latency
of 100 minutes. Thus, it will take 1000 minutes to wash all the clothes.
This works out to an average of 0.2 items of clothing per minute, or 12
items per hour.
Observe that, while the dryer is drying clothes, the washer is idle.
Similarly, while we are folding the laundry, both the dryer and washer
are idle. Right after we wash the first load, we place it in the dryer
and can immediately start another load. When the dryer is finished, we
can immediately take out the clothes, place them on the folding table,
and then put the (now finished) load from the washer into the dryer.
The process looks like this:
W-W-W-W-W-W
D-D-D-D-D-D
F-F-F-F-F-F
The washing, drying, and folding proceed in parallel. How long will it
take to wash all 200 items? The longest phase is drying at 45 minutes,
and we will have to dry 10 loads, so the drying time will be 450 minutes.
We have to add the latency for the first wash (35 minutes) and the last
fold (20 minutes), during which no drying is occuring, which accounts for
another 55 minutes. So the total time is 505 minutes, or an average of
0.396 items of clothing per minute, or 23.76 items per hour. This rate
represents a speedup of almost a factor of two over the previous case.
Note that we haven't actually sped up the laundering of a single item:
it still takes the same amount of time for a single shirt to enter and
leave the system. So, if you really must have that favorite blue shirt
in one hour, you are still out of luck. But the throughput of the whole
system has been improved.
This system is a three-stage pipeline. Pipelining is a very old trick to
increase performance for tasks that can be divided into independent stages.
It's known in manufacturing as using an assembly line. It's how cars
are made, for instance. Even though the latency of a single task pushed
through the entire pipeline may become worse due to overhead, the throughput
of the system is increased because the pipeline stages are overlapping,
i.e. they are running in parallel.
Pipelined CPUs
Pipelining has been applied to CPUs since 1959 with the IBM 7030 'Stretch'
processor. Once hardware budgets began to allow enough transistors on
a single chip in about the early 80s, pipelined microprocessors began
to appear. Let's consider a simple five-stage pipeline for a RISC
microprocessor:
- IF: Instruction fetch. Fetch instructions from memory through the
program counter (PC) and the PC is updated.
- ID: Decode instructions. Read the register sources mentioned in
the instruction from the register file. If the instruction is a jump,
add the PC-relative offset (sign-extended) to the program counter.
- EX: Execute the ALU instruction, or generate the effective address
for a memory operation. Feed the ALU (arithmetic/logic unit) the register
operands read in the previous stage and produce a result.
- MM: If the instruction is a load or store, access the memory through
the effective address generated in the previous stage.
- WB: Write registers values generated in the EX or MM stages back to
the register file.
How does this help make things go faster? By helping to increase
the clock frequency.
Clock Frequency
- The clock. A CPU, like many other kinds of digital circuits, marches
through tasks to the beat of the clock. Every time the clock ticks,
a new set of events occurs: some results are generated, some values are
transmitted across busses, etc.
- The clock frequency. The clock frequency tells us how often the clock
can tick. The faster the clock frequency is, the higher the throughput
of the microprocessor will be. Clock frequency is measured in clocks
per second, or these days, billions of clocks per second (GHz).
- The clock period. The clock period is simply the inverse of the
clock frequency. It's measured in seconds per clocks, or, these days,
picoseconds per clock. It tells us the maximum gate delay that any pipeline
stage may have, including latch delay for the special registers that buffer
results from one stage to the next.
- Gate depth and delay. The clock frequency depends on the depth of
the circuits being clocked. If current must flow serially through many
logic gates in a single clock cycle, the clock will be slower than if
there are only a few gates in series.
Pipelining Increases the Clock Frequency
Imagine the circuitry of a simple processor. It must have gates that
do all five stages of the five-stage pipeline I mentioned, even if it
isn't pipelined. The clock signal must flow from the beginning of the
circuit through the maximum-depth path of the circuit before the clock can
tick again. If we divide the circuitry into five independent and balanced
stages, the length of this path is divided by five, so the clock frequency
can be multiplied by five. If we can find a way to divide the work into
ten stages, then the clock can be multiplied by ten. This is the way it
works ideally; in practice improvement is more modest. Some barriers to
this "perfect" clock improvement are:
- Finding balance. It's difficult to divide the work of executing
instructions into n stages that have exactly the same gate delay,
and that delay is 1/nth that of the original design. The clock
frequency is limited by the delay of the deepest stage.
- Latch delay. Pipeline implementation includes latches or
pipeline registers between each stage that communicate results
from one stage to the next. As pipelines become deeper and clock rates
increase, the delay of these latches becomes a significant component of
the clock period.
- Power. As the clock rate increases, the number of switching events
per second in the processor increases. Each switching event consumes a
certain amount of energy and releases a certain amount of heat. We need to
make sure that at any instant, the power supply is capable of supplying the
energy and the heat being generated can be efficiently dissipated through
the package and out of the equipment. Improvements in cooling and power
supplies (e.g. batteries) are much slower than improvements in clock rate,
power and energy limit clock rates in today's processors.
Why Pipelining Works: Instruction-Level Parallelism
Pipelines work because many instructions executing sequentially are doing
independent tasks that can be done in parallel. This kind of parallelism
is called instruction-level parallelism (ILP). In the simple
pipeline we have seen, ILP can be hard to come by; however, there are many
tricks people have invented for squeezing more ILP out of the instruction
stream, like instruction reordering and speculation.
Obstacles to Pipelining: Hazards
We have seen a few physical limitations to pipelining. However, the
three main difficulties with pipelining have to do with the nature of the
instruction stream being executed. These hazards can prevent
a pipeline stage from correctly carrying out its purpose.
- Structural hazards. These occur when instructions contend for the
same resources in the CPU. For instance, if the register file has only
one write port, but for some reason the instruction stream has generated
two writes to the register file in a single cycle, one of the offending
pipeline stages will have to wait. Structural hazards can often be solved
by throwing more hardware at the problem, with the penalty of increased
gate count, complexity, and possibly delay.
- Data hazards. This happens when an instruction in the pipeline
depends on data from another instruction that is also in the pipeline.
For instance, consider these two instructions:
i add r1, r2, r3 // r1 := r2 + r3
i+1 add r4, r1, r5 // r4 := r1 + r5
r1 is needed by instruction i+1, but the value of r1 is modified by
instruction i and won't be written back to the register file before
instruction i+1 reads its operands. There are many techniques
for solving this problem. Forwarding (or bypass) is
the main technique.
- Control hazards. This happens when a control-flow transfer instruction
depends on results that are not ready yet. For instance, every conditional
branch presents a control flow hazard, since the condition isn't available
in time to fetch the next instruction from the right place.
Structural hazards can sometimes be solved by throwing more hardware at
the problem, e.g., adding another functional unit, but it is usually not
that simple because adding extra resources increases the amount of area
and communication required.
Control hazards other issues related to instruction fetch are so important
that the entire next lecture will be devoted to them.
The rest of the lecture will be devoted to data hazards.
Dependences
As an introduction to data hazards, we will see the different ways that
instructions can be dependent on one another. Note that dependences
are a property of programs. Not all dependences will affect the pipeline;
we are really interested in dependence in a small window of instructions
for pipelines. However, the compiler uses dependence information over a
much larger region to produce more efficient code.
- Data dependence. Also called true dependence or
flow dependence. This is where data needed by one instruction is
produced by a previous instruction, or where data needed by one instruction
flows through a chain of dependent instructions from some source.
- Name dependences. This type of dependence occurs when
two instructions use the same register or memory location, but there is
no flow of data between the instructions. For instance, two instructions
in the pipeline may both use r1 for unrelated temporary
computations. There are two types of name dependences:
- Anti-dependence. This occurs when one instruction
writes a register that will be read by another instruction. For
instance:
add r1, r2, r3 // r1 := r2 + r3
add r3, r4, r5 // r3 := r1 + r4
There is an anti-dependence between the two instructions because the
second instruction writes a register r3 that is used by
the first instruction. The processor must guarantee that the first
instruction reads the correct value before the second instruction
overwrites it.
- Output dependence. This occurs when two instructions
both write the same register. The processor must guarantee that
the register ends up with the value from the second instruction.
For instance:
add r3, r2, r1 // r1 := r2 + r3
st r3, 0(r6) // store r3 to memory address r6
add r3, r4, r5 // r3 := r1 + r4
Here, r3 is being used for two different purposes.
We must ensure that, at the end of this code, the register file
entry for r3 is updated with the result from the second
add instruction.
When these dependences occur in such a way that they are exposed to the
pipeline, three different types of data hazards may occur:
- RAW, or read-after-write. An instruction tries to read an operand
before a previous instruction has a chance to write it. This is caused
by a true dependence.
- WAW, or write-after-write. An instruction tries to write an operand
before a previous instruction has a chance to write it. Once the previous
instruction writes it, the operand is left with the previous and now wrong
value. This is caused by an output dependence.
- WAR, or write-after-read. An instruction writes to an operand before
it can be read by a previous instruction, so the previous instruction
incorrectly gets the new value. This is caused by an anti-dependence.
(What about RAR?)
Solutions to Data Hazards
The following solutions have been proposed and used for solving the problem
of data hazards:
- Pipeline stall cycles. Can resolve any type of hazard. Freeze the
pipeline up to the dependent stage until the hazard is resolved. Example:
add r1, r2, r3
add r4, r1, r5
Cycles----->
________________________ Instructions
|_IF_|_ID_|_EX_|_MM_|_WB_|______________ add r1, r2, r3
|_IF_|_x_x_x_x_|_ID_|_EX_|_MM_|_WB_| add r4, r1, r5
stall cycles
Once all the dependences of an instruction are satisfied, we issue
the instruction, i.e., we allow it to proceed.
- Forwarding (bypass). If the data is available elsewhere in
the pipeline, then there is no need to stall. When the dependence is
detected, the data is forwarded directly to the consuming pipeline stage.
Reduces stall cycles, or sometimes eliminate them.
Consider the following C function:
int a, b, c, d, e, f, g;
int foo (void) {
d = e * f;
a = b * c;
g = a + d;
}
It may be compiled to something like this:
foo:
movl e,%edx // d := e * f
imull f,%edx
movl %edx,d
movl b,%eax // a := b * c
imull c,%eax
movl %eax,a
addl %edx,%eax // g := a + d
movl %eax,g
ret
What are some of the data dependences?
How would some of these dependences manifest themselves as data hazards?