Lecture 6: Wide Issue and Speculation

IPC

The goal is high-performance. The means are: high IPC and high clock rates. We get high clock rates through pipelining (as well as advances in process technology). However, pipelining hurts IPC because of pipelining hazards. To address this, we must find more parallelism. Even in the ideal case, the best IPC we can hope for in single-issue processors is 1.0. We will see that by issuing multiple instructions every clock cycle, we can exceed that limit. We will also see that speculating across control dependences has the potential to increase parallelism and thus IPC.

Wide Issue

Your book calls this technique multiple issue. Put simply, it means issuing more than one instruction in a clock cycle. There are many flavors of wide issue processors. This table taxonomizes them. It is reproduced from your book on page 115.
 
Common name
Issue structure
Hazard detection
Scheduling
Distinguishing characteristic
Examples
Superscalar (static)
dynamic
hardware
static
in-order execution
Sun UltraSPARC II/III, embedded MIPS and ARM/Intel XScale
Superscalar (dynamic)
dynamic
hardware
dynamic
some out-of-order execution
IBM Power2
Superscalar (speculative)
dynamic
hardware
dynamic with speculation
out-of-order with speculation
Intel Pentium 4, Intel Core, MIPS R12K, CompaQ Alpha EV6, IBM Power5
VLIW/LIW
static
software
static
no hazards between issue packets
i860
EPIC
mostly static
mostly software
mostly static
explicit dependences marked by compiler
Intel Itanium, Intel Itanium2
The main idea is to fetch, decode, issue, and hopefully execute more than one instruction per clock cycle. In this way, we can increase IPC above 1.0. However, issuing multiple instructions per cycle adds more complexity to the microarchitecture. It may have to be more deeply pipelined to sustain the same clock rate as a single-issue version. All manufacturers of general-purpose microprocessors have decided that the extra performance is worth the extra complexity. Here are the several flavors of n-issue processors:

Static vs. Dynamic

These ideas can be divided into two camps: We have discussed instruction scheduling before, but wide-issue is where it becomes critically important. We have to make sure that instructions that are issued in the same cycle have all their dependences met, including dependences from the past and dependences from each other. The two camps outlined above highlight the problem with wide-issue instruction scheduling:
  1. The compiler knows enough about the past and future to do a reasonable job of scheduling. With static scheduling, the compiler does a lot of work to figure out the schedule once. This work is amortized over every execution of the scheduled code, which for production systems can mean that the scheduling is essentially free.
  2. On the other hand, the microarchitecture potentially knows everything about the past, and can do a reasonable job of predicting the future, so it can do a better job of scheduling. In particular, the microarchitecture can deal with problems such as aliasing that are very difficult to deal with in the compiler (sometimes undecidable). However, now the scheduling work is being done all the time, on-line. This seems like a great waste of effort compared with static scheduling.
The fact that every type of wide-issue has existing examples shows that we as a community haven't yet decided how we want to do scheduling.

Hardware Speculation

We have seen how branch prediction can speed up instruction fetch and we have seen hints about how branch prediction can allow speculation. Here, we'll go into more detail about hardware speculation. Later on, we'll see how software can aid in speculation.

The basic idea is to treat branch predictions as if they are correct, and speculatively execute the resulting instructions. The speculations are verified, and if there is a branch misprediction something special happens to get rid of the mis-speculated instructions.

We do this all in the context of dynamic scheduling, i.e., out-of-order execution. However, now, there is an extra phase in the algorithm, the commit phase. Instructions are committed when we are sure they were supposed to execute, i.e., we know there is no misprediction. Results are held in the reorder buffer (ROB) until they are ready to be committed. The ROB can be thought of as a queue where instructions are dequeued in order, i.e., instructions are fetched in-order, processed out-of-order and placed into the ROB, and then "graduate" in-order again from the ROB. The entries in the reorder buffer now form the virtual register file that physical registers are renamed into (along with results at reservation stations).

Here are the four phases:

  1. Issue.
  2. Execute.
  3. Write result
  4. Commit. Instructions are processed from the ROB in order. There are three cases:
This is the clean, idealized version of speculative out-of-order. In practice, implementations work hard to reveal mispredictions as early as possible, and weed out only those instructions that were issued after the mis-speculated branch. Still, the cost of a mispredicted branch can be very high, a minimum of 31 cycles on the Pentium 4, and 14 cycles on Intel Core, so a good branch predictor is essential.