Lecture 3: Pipelining and Instruction-Level Parallelism

Many things that we study in computer structures are of the following form: Here's a great idea, but there are all these problems in actually implementing it so let's spend 1% of the time admiring the great idea and 99% of the time figuring out how to get it to work. Pipelining is a great example of this.

Motivating Example

The goal with pipelining is to speed things up. More specifically, we want to increase the throughput of the system, where throughput is defined as the amount of work performed per unit time. Let's look at a very simple example of how pipelining does this:

The Laundry Example

Suppose the work we want to do is laundry. There are three main steps in this process: We have a lot of clothes to wash, so not all clothes will fit into the washing machine. Thus, we will have to divide the work into several loads. One way to wash all the clothes is sequentially:
W-D-F-W-D-F-W-D-F-W-D-F-W-D-F.....
Suppose that the washer can accomodate 20 items of clothing in one load, and that we have 200 items total to be washed. How long will it take to wash them? We will have to do 10 loads, and each one will have a latency of 100 minutes. Thus, it will take 1000 minutes to wash all the clothes. This works out to an average of 0.2 items of clothing per minute, or 12 items per hour.

Observe that, while the dryer is drying clothes, the washer is idle. Similarly, while we are folding the laundry, both the dryer and washer are idle. Right after we wash the first load, we place it in the dryer and can immediately start another load. When the dryer is finished, we can immediately take out the clothes, place them on the folding table, and then put the (now finished) load from the washer into the dryer. The process looks like this:

W-W-W-W-W-W
 D-D-D-D-D-D
  F-F-F-F-F-F
The washing, drying, and folding proceed in parallel. How long will it take to wash all 200 items? The longest phase is drying at 45 minutes, and we will have to dry 10 loads, so the drying time will be 450 minutes. We have to add the latency for the first wash (35 minutes) and the last fold (20 minutes), during which no drying is occuring, which accounts for another 55 minutes. So the total time is 505 minutes, or an average of 0.396 items of clothing per minute, or 23.76 items per hour. This rate represents a speedup of almost a factor of two over the previous case.

Note that we haven't actually sped up the laundering of a single item: it still takes the same amount of time for a single shirt to enter and leave the system. So, if you really must have that favorite blue shirt in one hour, you are still out of luck. But the throughput of the whole system has been improved.

This system is a three-stage pipeline. Pipelining is a very old trick to increase performance for tasks that can be divided into independent stages. It's known in manufacturing as using an assembly line. It's how cars are made, for instance. Even though the latency of a single task pushed through the entire pipeline may become worse due to overhead, the throughput of the system is increased because the pipeline stages are overlapping, i.e. they are running in parallel.

Pipelined CPUs

Pipelining has been applied to CPUs since 1959 with the IBM 7030 'Stretch' processor. Once hardware budgets began to allow enough transistors on a single chip in about the early 80s, pipelined microprocessors began to appear. Let's consider a simple five-stage pipeline for a RISC microprocessor:
  1. IF: Instruction fetch. Fetch instructions from memory through the program counter (PC) and the PC is updated.
  2. ID: Decode instructions. Read the register sources mentioned in the instruction from the register file. If the instruction is a jump, add the PC-relative offset (sign-extended) to the program counter.
  3. EX: Execute the ALU instruction, or generate the effective address for a memory operation. Feed the ALU (arithmetic/logic unit) the register operands read in the previous stage and produce a result.
  4. MM: If the instruction is a load or store, access the memory through the effective address generated in the previous stage.
  5. WB: Write registers values generated in the EX or MM stages back to the register file.
How does this help make things go faster? By helping to increase the clock frequency.

Clock Frequency

Pipelining Increases the Clock Frequency

Imagine the circuitry of a simple processor. It must have gates that do all five stages of the five-stage pipeline I mentioned, even if it isn't pipelined. The clock signal must flow from the beginning of the circuit through the maximum-depth path of the circuit before the clock can tick again. If we divide the circuitry into five independent and balanced stages, the length of this path is divided by five, so the clock frequency can be multiplied by five. If we can find a way to divide the work into ten stages, then the clock can be multiplied by ten. This is the way it works ideally; in practice improvement is more modest. Some barriers to this "perfect" clock improvement are:

Why Pipelining Works: Instruction-Level Parallelism

Pipelines work because many instructions executing sequentially are doing independent tasks that can be done in parallel. This kind of parallelism is called instruction-level parallelism (ILP). In the simple pipeline we have seen, ILP can be hard to come by; however, there are many tricks people have invented for squeezing more ILP out of the instruction stream, like instruction reordering and speculation.

Obstacles to Pipelining: Hazards

We have seen a few physical limitations to pipelining. However, the three main difficulties with pipelining have to do with the nature of the instruction stream being executed. These hazards can prevent a pipeline stage from correctly carrying out its purpose. Structural hazards can sometimes be solved by throwing more hardware at the problem, e.g., adding another functional unit, but it is usually not that simple because adding extra resources increases the amount of area and communication required.

Control hazards other issues related to instruction fetch are so important that the entire next lecture will be devoted to them.

The rest of the lecture will be devoted to data hazards.

Dependences

As an introduction to data hazards, we will see the different ways that instructions can be dependent on one another. Note that dependences are a property of programs. Not all dependences will affect the pipeline; we are really interested in dependence in a small window of instructions for pipelines. However, the compiler uses dependence information over a much larger region to produce more efficient code. When these dependences occur in such a way that they are exposed to the pipeline, three different types of data hazards may occur: (What about RAR?)

Solutions to Data Hazards

The following solutions have been proposed and used for solving the problem of data hazards: Consider the following C function:
int a, b, c, d, e, f, g;

int foo (void) {
        d = e * f;
        a = b * c;
        g = a + d;
}
It may be compiled to something like this:
foo:
        movl e,%edx     // d := e * f
        imull f,%edx
        movl %edx,d

        movl b,%eax     // a := b * c
        imull c,%eax
        movl %eax,a

        addl %edx,%eax  // g := a + d
        movl %eax,g

        ret
What are some of the data dependences? How would some of these dependences manifest themselves as data hazards?