Lecture 3: Pipelining and Instruction-Level Parallelism

Many things that we study in computer structures are of the following form: Here's a great idea, but there are all these problems in actually implementing it so let's spend 1% of the time admiring the great idea and 99% of the time figuring out how to get it to work. Pipelining is a great example of this.

Motivating Example

The goal with pipelining is to speed things up. More specifically, we want to increase the throughput of the system, where throughput is defined as the amount of work performed per unit time. Let's look at a very simple example of how pipelining does this:

The Laundry Example

Suppose the work we want to do is laundry. There are three main steps in this process:

(W) Wash the clothes in the washing machine (takes 35 minutes)
(D) Dry the clothes in the dryer (takes 45 minutes)
(F) Fold and store the clothes (takes 20 minutes)

We have a lot of clothes to wash, so not all clothes will fit into the washing machine. Thus, we will have to divide the work into several loads. One way to wash all the clothes is sequentially:

W-D-F-W-D-F-W-D-F-W-D-F-W-D-F.....

Suppose that the washer can accomodate 20 items of clothing in one load, and that we have 200 items total to be washed. How long will it take to wash them? We will have to do 10 loads, and each one will have a latency of 100 minutes. Thus, it will take 1000 minutes to wash all the clothes. This works out to an average of 0.2 items of clothing per minute, or 12 items per hour.

Observe that, while the dryer is drying clothes, the washer is idle. Similarly, while we are folding the laundry, both the dryer and washer are idle. Right after we wash the first load, we place it in the dryer and can immediately start another load. When the dryer is finished, we can immediately take out the clothes, place them on the folding table, and then put the (now finished) load from the washer into the dryer. The process looks like this:

W-W-W-W-W-W
 D-D-D-D-D-D
  F-F-F-F-F-F

The washing, drying, and folding proceed in parallel. How long will it take to wash all 200 items? The longest phase is drying at 45 minutes, and we will have to dry 10 loads, so the drying time will be 450 minutes. We have to add the latency for the first wash (35 minutes) and the last fold (20 minutes), during which no drying is occuring, which accounts for another 55 minutes. So the total time is 505 minutes, or an average of 0.396 items of clothing per minute, or 23.76 items per hour. This rate represents a speedup of almost a factor of two over the previous case.

Note that we haven't actually sped up the laundering of a single item: it still takes the same amount of time for a single shirt to enter and leave the system. So, if you really must have that favorite blue shirt in one hour, you are still out of luck. But the throughput of the whole system has been improved.

This system is a three-stage pipeline. Pipelining is a very old trick to increase performance for tasks that can be divided into independent stages. It's known in manufacturing as using an assembly line. It's how cars are made, for instance. Even though the latency of a single task pushed through the entire pipeline may become worse due to overhead, the throughput of the system is increased because the pipeline stages are overlapping, i.e. they are running in parallel.

Pipelined CPUs

Pipelining has been applied to CPUs since 1959 with the IBM 7030 'Stretch' processor. Once hardware budgets began to allow enough transistors on a single chip in about the early 80s, pipelined microprocessors began to appear. Let's consider a simple five-stage pipeline for a RISC microprocessor:

IF: Instruction fetch. Fetch instructions from memory through the program counter (PC) and the PC is updated.
ID: Decode instructions. Read the register sources mentioned in the instruction from the register file. If the instruction is a jump, add the PC-relative offset (sign-extended) to the program counter.
EX: Execute the ALU instruction, or generate the effective address for a memory operation. Feed the ALU (arithmetic/logic unit) the register operands read in the previous stage and produce a result.
MM: If the instruction is a load or store, access the memory through the effective address generated in the previous stage.
WB: Write registers values generated in the EX or MM stages back to the register file.

How does this help make things go faster? By helping to increase the clock frequency.

Clock Frequency

The clock. A CPU, like many other kinds of digital circuits, marches through tasks to the beat of the clock. Every time the clock ticks, a new set of events occurs: some results are generated, some values are transmitted across busses, etc.
The clock frequency. The clock frequency tells us how often the clock can tick. The faster the clock frequency is, the higher the throughput of the microprocessor will be. Clock frequency is measured in clocks per second, or these days, billions of clocks per second (GHz).
The clock period. The clock period is simply the inverse of the clock frequency. It's measured in seconds per clocks, or, these days, picoseconds per clock. It tells us the maximum gate delay that any pipeline stage may have, including latch delay for the special registers that buffer results from one stage to the next.
Gate depth and delay. The clock frequency depends on the depth of the circuits being clocked. If current must flow serially through many logic gates in a single clock cycle, the clock will be slower than if there are only a few gates in series.

Pipelining Increases the Clock Frequency

Imagine the circuitry of a simple processor. It must have gates that do all five stages of the five-stage pipeline I mentioned, even if it isn't pipelined. The clock signal must flow from the beginning of the circuit through the maximum-depth path of the circuit before the clock can tick again. If we divide the circuitry into five independent and balanced stages, the length of this path is divided by five, so the clock frequency can be multiplied by five. If we can find a way to divide the work into ten stages, then the clock can be multiplied by ten. This is the way it works ideally; in practice improvement is more modest. Some barriers to this "perfect" clock improvement are:

Finding balance. It's difficult to divide the work of executing instructions into n stages that have exactly the same gate delay, and that delay is 1/nth that of the original design. The clock frequency is limited by the delay of the deepest stage.
Latch delay. Pipeline implementation includes latches or pipeline registers between each stage that communicate results from one stage to the next. As pipelines become deeper and clock rates increase, the delay of these latches becomes a significant component of the clock period.
Power. As the clock rate increases, the number of switching events per second in the processor increases. Each switching event consumes a certain amount of energy and releases a certain amount of heat. We need to make sure that at any instant, the power supply is capable of supplying the energy and the heat being generated can be efficiently dissipated through the package and out of the equipment. Improvements in cooling and power supplies (e.g. batteries) are much slower than improvements in clock rate, power and energy limit clock rates in today's processors.

Why Pipelining Works: Instruction-Level Parallelism

Pipelines work because many instructions executing sequentially are doing independent tasks that can be done in parallel. This kind of parallelism is called instruction-level parallelism (ILP). In the simple pipeline we have seen, ILP can be hard to come by; however, there are many tricks people have invented for squeezing more ILP out of the instruction stream, like instruction reordering and speculation.

Obstacles to Pipelining: Hazards

We have seen a few physical limitations to pipelining. However, the three main difficulties with pipelining have to do with the nature of the instruction stream being executed. These hazards can prevent a pipeline stage from correctly carrying out its purpose.

Structural hazards. These occur when instructions contend for the same resources in the CPU. For instance, if the register file has only one write port, but for some reason the instruction stream has generated two writes to the register file in a single cycle, one of the offending pipeline stages will have to wait. Structural hazards can often be solved by throwing more hardware at the problem, with the penalty of increased gate count, complexity, and possibly delay.
Data hazards. This happens when an instruction in the pipeline depends on data from another instruction that is also in the pipeline. For instance, consider these two instructions:
```
i      add     r1, r2, r3      // r1 := r2 + r3
i+1    add     r4, r1, r5      // r4 := r1 + r5
```
r1 is needed by instruction i+1, but the value of r1 is modified by instruction i and won't be written back to the register file before instruction i+1 reads its operands. There are many techniques for solving this problem. Forwarding (or bypass) is the main technique.
Control hazards. This happens when a control-flow transfer instruction depends on results that are not ready yet. For instance, every conditional branch presents a control flow hazard, since the condition isn't available in time to fetch the next instruction from the right place.

Structural hazards can sometimes be solved by throwing more hardware at the problem, e.g., adding another functional unit, but it is usually not that simple because adding extra resources increases the amount of area and communication required.

Control hazards other issues related to instruction fetch are so important that the entire next lecture will be devoted to them.

The rest of the lecture will be devoted to data hazards.

Dependences

As an introduction to data hazards, we will see the different ways that instructions can be dependent on one another. Note that dependences are a property of programs. Not all dependences will affect the pipeline; we are really interested in dependence in a small window of instructions for pipelines. However, the compiler uses dependence information over a much larger region to produce more efficient code.

Data dependence. Also called true dependence or flow dependence. This is where data needed by one instruction is produced by a previous instruction, or where data needed by one instruction flows through a chain of dependent instructions from some source.
Name dependences. This type of dependence occurs when two instructions use the same register or memory location, but there is no flow of data between the instructions. For instance, two instructions in the pipeline may both use r1 for unrelated temporary computations. There are two types of name dependences:
- Anti-dependence. This occurs when one instruction writes a register that will be read by another instruction. For instance:
```
        add     r1, r2, r3      // r1 := r2 + r3
        add     r3, r4, r5      // r3 := r1 + r4
        
```
  There is an anti-dependence between the two instructions because the second instruction writes a register r3 that is used by the first instruction. The processor must guarantee that the first instruction reads the correct value before the second instruction overwrites it.
- Output dependence. This occurs when two instructions both write the same register. The processor must guarantee that the register ends up with the value from the second instruction. For instance:
```
        add     r3, r2, r1      // r1 := r2 + r3
        st      r3, 0(r6)       // store r3 to memory address r6
        add     r3, r4, r5      // r3 := r1 + r4
        
```
  Here, r3 is being used for two different purposes. We must ensure that, at the end of this code, the register file entry for r3 is updated with the result from the second add instruction.

When these dependences occur in such a way that they are exposed to the pipeline, three different types of data hazards may occur:

RAW, or read-after-write. An instruction tries to read an operand before a previous instruction has a chance to write it. This is caused by a true dependence.
WAW, or write-after-write. An instruction tries to write an operand before a previous instruction has a chance to write it. Once the previous instruction writes it, the operand is left with the previous and now wrong value. This is caused by an output dependence.
WAR, or write-after-read. An instruction writes to an operand before it can be read by a previous instruction, so the previous instruction incorrectly gets the new value. This is caused by an anti-dependence.

(What about RAR?)

Solutions to Data Hazards

The following solutions have been proposed and used for solving the problem of data hazards:

Pipeline stall cycles. Can resolve any type of hazard. Freeze the pipeline up to the dependent stage until the hazard is resolved. Example:

	add r1, r2, r3
	add r4, r1, r5


   Cycles----->
 ________________________                        Instructions 
|_IF_|_ID_|_EX_|_MM_|_WB_|______________         add r1, r2, r3
     |_IF_|_x_x_x_x_|_ID_|_EX_|_MM_|_WB_|        add r4, r1, r5
         stall cycles

Once all the dependences of an instruction are satisfied, we issue the instruction, i.e., we allow it to proceed.

Forwarding (bypass). If the data is available elsewhere in the pipeline, then there is no need to stall. When the dependence is detected, the data is forwarded directly to the consuming pipeline stage. Reduces stall cycles, or sometimes eliminate them.

Consider the following C function:

int a, b, c, d, e, f, g;

int foo (void) {
        d = e * f;
        a = b * c;
        g = a + d;
}

It may be compiled to something like this:

foo:
        movl e,%edx     // d := e * f
        imull f,%edx
        movl %edx,d

        movl b,%eax     // a := b * c
        imull c,%eax
        movl %eax,a

        addl %edx,%eax  // g := a + d
        movl %eax,g

        ret

What are some of the data dependences? How would some of these dependences manifest themselves as data hazards?