Lecture 6: Wide Issue and Speculation

IPC

The goal is high-performance. The means are: high IPC and high clock rates. We get high clock rates through pipelining (as well as advances in process technology). However, pipelining hurts IPC because of pipelining hazards. To address this, we must find more parallelism. Even in the ideal case, the best IPC we can hope for in single-issue processors is 1.0. We will see that by issuing multiple instructions every clock cycle, we can exceed that limit. We will also see that speculating across control dependences has the potential to increase parallelism and thus IPC.

Wide Issue

Your book calls this technique multiple issue. Put simply, it means issuing more than one instruction in a clock cycle. There are many flavors of wide issue processors. This table taxonomizes them. It is reproduced from your book on page 115.


Common name	Issue structure	Hazard detection	Scheduling	Distinguishing characteristic	Examples
Superscalar (static)	dynamic	hardware	static	in-order execution	Sun UltraSPARC II/III, embedded MIPS and ARM/Intel XScale
Superscalar (dynamic)	dynamic	hardware	dynamic	some out-of-order execution	IBM Power2
Superscalar (speculative)	dynamic	hardware	dynamic with speculation	out-of-order with speculation	Intel Pentium 4, Intel Core, MIPS R12K, CompaQ Alpha EV6, IBM Power5
VLIW/LIW	static	software	static	no hazards between issue packets	i860
EPIC	mostly static	mostly software	mostly static	explicit dependences marked by compiler	Intel Itanium, Intel Itanium2

The main idea is to fetch, decode, issue, and hopefully execute more than one instruction per clock cycle. In this way, we can increase IPC above 1.0. However, issuing multiple instructions per cycle adds more complexity to the microarchitecture. It may have to be more deeply pipelined to sustain the same clock rate as a single-issue version. All manufacturers of general-purpose microprocessors have decided that the extra performance is worth the extra complexity. Here are the several flavors of n-issue processors:

Static superscalar. Up to n instructions are fetched in a single cycle, usually stopping once n instructions or a predicted taken branch is reached. The dependences are checked in the second part of the issue phase, just like the single-issue case. The instructions are scheduled in order, so the compiler has to do a good job of scheduling.
Dynamic superscalar. Up to n instructions are fetched in a single cycle, as before. However, they enter a dynamic scheduling algorithm with reservation stations, etc. as in the last lecture.
Speculative superscalar. This is dynamic superscalar with speculation, i.e., branch predictions are speculatively acted upon.
VLIW, for "very long instruction word." At the other extreme, VLIW ISAs have ILP specified explicitly. "No" dependency checks are done at run-time. At compile-time, many instructions are packed into a single very long instruction. These long instructions are fetched one at a time, forming an issue packet that is issued all at once. The compiler is completely responsible for making sure that instructions within long instructions are independent, and that instruction dependencies will all have been satisfied. In the simplistic case, a stall in any functional unit (e.g. a cache miss) causes the entire pipeline to freeze.
EPIC, for "explicitly parallel instruction computing." This is Intel's term for VLIW plus some dynamic checks. The compiler is still responsible for scheduling instructions, but there is also speculation that can be controlled by the compiler as well as the microarchitecture.

Static vs. Dynamic

These ideas can be divided into two camps:

Statically scheduled. Have the compiler worry about hazards, because:
- We believe the compiler can be smart about this, or
- We can't afford to do it dynamically because of constraints related to the implementation.
Dynamically scheduled. Have the microarchitecture worry about hazards, because
- We don't trust the compiler, and
- We have enough transistors.

We have discussed instruction scheduling before, but wide-issue is where it becomes critically important. We have to make sure that instructions that are issued in the same cycle have all their dependences met, including dependences from the past and dependences from each other. The two camps outlined above highlight the problem with wide-issue instruction scheduling:

The compiler knows enough about the past and future to do a reasonable job of scheduling. With static scheduling, the compiler does a lot of work to figure out the schedule once. This work is amortized over every execution of the scheduled code, which for production systems can mean that the scheduling is essentially free.
On the other hand, the microarchitecture potentially knows everything about the past, and can do a reasonable job of predicting the future, so it can do a better job of scheduling. In particular, the microarchitecture can deal with problems such as aliasing that are very difficult to deal with in the compiler (sometimes undecidable). However, now the scheduling work is being done all the time, on-line. This seems like a great waste of effort compared with static scheduling.

The fact that every type of wide-issue has existing examples shows that we as a community haven't yet decided how we want to do scheduling.

Hardware Speculation

We have seen how branch prediction can speed up instruction fetch and we have seen hints about how branch prediction can allow speculation. Here, we'll go into more detail about hardware speculation. Later on, we'll see how software can aid in speculation.

The basic idea is to treat branch predictions as if they are correct, and speculatively execute the resulting instructions. The speculations are verified, and if there is a branch misprediction something special happens to get rid of the mis-speculated instructions.

We do this all in the context of dynamic scheduling, i.e., out-of-order execution. However, now, there is an extra phase in the algorithm, the commit phase. Instructions are committed when we are sure they were supposed to execute, i.e., we know there is no misprediction. Results are held in the reorder buffer (ROB) until they are ready to be committed. The ROB can be thought of as a queue where instructions are dequeued in order, i.e., instructions are fetched in-order, processed out-of-order and placed into the ROB, and then "graduate" in-order again from the ROB. The entries in the reorder buffer now form the virtual register file that physical registers are renamed into (along with results at reservation stations).

Here are the four phases:

Issue.
- Dequeue an instruction from the issue queue.
- Find a reservation station for it and reserve a slot for it the ROB.
- Send operands to the reservation station from the ROB.
- Send number of the reserved ROB entry to the reservation station so it can place its result there.
- If reservation stations or ROB entires are unavailable, stall.
Execute.
- Wait for data dependences to be satisfied at reservation stations.
- When all operands at a reservation station are available, execute the instruction.
Write result
- Write the result of executing an instruction on the CDB and into the ROB.
- Instructions waiting on this result go to the corresponding reservation stations.
- Store instructions write into the ROB, not the memory.
Commit. Instructions are processed from the ROB in order. There are three cases:
- If the instruction is a branch with an incorrect prediction, flush the ROB and resume fetch at the next instruction after the offending branch.
- If the instruction is a store, then the value is stored to memory. Since it has reached the head of the ROB in order, it can't have been mispredicted.
- Otherwise, the instruction updates the register file.

This is the clean, idealized version of speculative out-of-order. In practice, implementations work hard to reveal mispredictions as early as possible, and weed out only those instructions that were issued after the mis-speculated branch. Still, the cost of a mispredicted branch can be very high, a minimum of 31 cycles on the Pentium 4, and 14 cycles on Intel Core, so a good branch predictor is essential.