Lecture 5: Out-of-order Execution

Review of the Model

The following diagram roughly represents the general model you should have in your mind of the main components of the computer:





Instruction Scheduling

Before we talk about out-of-order execution, let's remember how execution proceeds in our standard pipeline. Instructions are fetched, decoded, executed, etc. The decode stage is where we find out about structural, data, and control hazards. The hardware does what it can to minimize the impact of these hazards, but it is really up to the compiler to schedule dependent instructions far away from each other to avoid hazards. For instance, obviously, the compiler should "hoist" loads as far back into the past as possible to avoid RAW hazards from dependent instructions. Also, the compiler should avoid scheduling instructions close together that will complete for some limited resource such as a multiplier unit to avoid a structural hazard. Unfortunately, the compiler is limited by several factors: So, it might be better to have the job of instruction scheduling done in the microarchitecture, or shared between the microarchitecture and the compiler.

Out-of-order Execution

The pipelines we have studied so far have been statically scheduled and inorder pipelines. That is, instructions are executed in program order. If a hazard causes stall cycles, then all instructions up to the offending instruction are stalled until the hazard is gone. As we have seen, forwarding, branch prediction, and other techniques can reduce the number of stall cycles we need, but sometimes a stall is unavoidable. For instance, consider the following code:
	ld	r1, 0(r2)	// load r1 from memory at r2
	add	r2, r1, r3	// r2 := r1 + r3
	add	r4, r3, r5	// r4 := r3 + r5
Suppose that r3 and r5 are ready in the register file. Suppose also that the load instruction misses in the L1 data cache, so the load unit takes about 20 cycles to bring the data from the L2 cache. During the time the load unit is working, the pipeline is stalled. Notice, however, that the second add instruction doesn't depend on the value of r1; it could issue and execute, but the details of our inorder pipeline prevent that because of the stall. The function unit that should be adding r3 and r5 together is instead sitting idle, waiting for the load to complete so the add instruction can be decoded and issued.

Out-of-order execution, or dynamic scheduling, is a technique used to get back some of that wasted execution bandwidth. With out-of-order execution (OoO for short), the processor would issue each of the instructions in program order, and then enter a new pipeline stage called "read operands" during which instructions whose operands are available would move to the execution stage, regardless of their order in the program. The term issue could be redefined at this point to mean "issue and read operands."

Implementation of Out-of-order Execution

To implement an OoO processor, the pipeline has to be enhanced to keep track of the extra complexity. For instance, now that we can reorder instructions, we can have WAR and WAW hazards that we didn't have to worry about with an inorder pipeline. This diagram illustrates the basic idea:





The following tricks are used: The new pipeline is divided into three phases, each of which could take a number of clock cycles:
    (This stuff is all from Chapter 2)
  1. Issue:
  2. Execute. At the reservation station for this instruction, the following actions may be taken:
  3. Write result. Once the result of an executed instruction becomes available, broadcast it over the CDB. Reservation stations that are waiting for the result of this instruction may make forward progress. During this phase, stores to memory are also executed.
Note that this scheme is non-speculative. Branch prediction is used to fetch instructions, but instructions are not executed until all of their dependences are satisfied, including control dependences. So, there is no problem with instructions fetched down the wrong path; they are never executed because they are discarded once their dependent branches are executed.

Reservation Stations

You can think of the reservation stations as structures or records in some program. Each functional unit might have several reservation stations forming a sort of queue where instructions sit and wait to for their operands to become available and for the functional unit to become available. The components of a reservation station for an instruction whose source inputs are Sj and Sk are: In addition, each register in the physical register file has an entry, Qi, that gives the tag of the reservation station holding the instruction whose result should be stored into that register. If Qi is zero, then the value in the register file is the actual value of that register, i.e., the register is not renamed at that point.

Example

Let's look at an example from the book, on page 99. Consider the following code:
	ld	f6,34(r2)	// f6 := memory at r2 + 34
	ld	f2,45(r3)	// f2 := memory at r3 + 45
	mul	f0,f2,f4	// f0 := f2 * f4
	sub	f8,f2,f6	// f8 := f2 - f6
	div	f10,f0,f6	// f10 := f0 / f6
	add	f6,f8,f2	// f6 := f8 + f2
Let's look at what the reservation stations will look like once the first load has completed. The second load has done its effective address computation, but is still waiting to use the memory unit. Rather than using numbers for the reservation station tags, we'll use a combination of names and numbers, e.g., Add3.

Here is how the reservation stations would look:

Name	Busy	Op	Vj	Vk		 Qj	Qk	A
Load1	no	
Load2	yes	Load						45 + Regs[r3]
Add1	yes	Sub		Mem[34+Regs[r3]] Load2	0
Add2	yes	Add				 Add1   Load2
Add3	no
Mult1	yes	Mul		Regs[f4]	 Load2
Mult2	yes	Div		Mem[34+Regs[r2]] Mult1
And here is how the Qi field for the floating point register file would look:
Field	F0	F2	F4	F6	F8	F10	...
Qi	Mult1	Load2		Add2	Add1	Mult2