Lecture 8: More Caches, Virtual Memory

Cache Design

Caches are used to help achieve good performance with slow main memories. If we're interested in good performance, then why not build a huge, fully-associative cache with single-cycle access? Unfortunately, it's not so simple.

Cache Design Considerations

When designing a cache, there are three main things to consider: The first two have a direct effect on performance. The last has an impact on the cost of the processor or the availability of chip area for other purposes. For L1 caches, the most important components are miss rate and delay. For an inorder machine, we'd like to have single-cycle access to the L1 cache; otherwise, every load incurs at least one stall cycle. Out-of-order machines can relax this requirement a little, but instruction throughput will suffer for every bit of delay in the cache. The following factors affect miss rate and delay:

Tricks for Building Better Caches

Designers have come up with several tricks for decreasing or mitigating the delay of caches. Here are a few:

An Example

Suppose we are building an inorder processor in 130nm technology for which we want to have a very high clock rate so we can sell more processors. We decide to have a small 8KB data cache to minimize delay. We want the cache to have single-cycle access. We can choose between the following configurations:
 
# Sets per bank
Block size
Associativity
Access time
Cycle time
Miss rate
# Banks
128
64B
Direct Mapped
690 ps
242 ps
10.3%
1
64
64B
2-way
898 ps
300 ps
7.5%
1
128
16B
4-way
906 ps
302 ps
6.5%
1
32
64B
Direct Mapped
603 ps
201 ps
10.3%
4
Which cache is the best choice in terms of performance? Unfortunately, the answer is not clear and depends on other information we don't have yet. Let's consider two possibilities:
  1. Clearly the 4-way associative cache has the lowest miss rate, so it will deliver the highest IPC. However, since we have said that we want the cache to have single-cycle access, we are limited to a clock period of no more than 906 ps (actually, it will be a little higher because of pipeline register delay). So the maximum clock rate is 1.0 / (906 * 10^-12) = 1.1 GHz. Suppose the perfect cache (i.e. no misses at all) yields an IPC of 1.0, and the cache miss rate is the only factor affecting IPC. Suppose every cache miss incurs a 60ns penalty, and 20% of all instructions are memory instructions. A 60ns miss penalty translates into 66 cycles with a 1.1GHz clock rate. Stall cycles per instruction due to cache misses will be 66 * 0.065 * 0.2 = 0.858, so the new CPI is 1.858. Then the new IPC will be 1 / 1.858 = 0.538. The number of instructions per second (IPS), which is the most important metric for performance, is 0.538 IPC * 1.1 billion Hz = 592 million instructions per second.
  2. The direct-mapped cache with four banks is pretty fast, allowing a maximum clock rate of 1.0 / (690 * 10^-12) 1.5 GHz. But the miss rate is higher, so IPC will be lower. A 60ns miss penalty translates to 90 cycles for a 1.5GHz clock rate. The CPI is 1 + 90 * 0.103 * 0.2 = 2.854, so the IPC is 0.350. Thus, the IPS is 0.350 * 1.5 billion = 525 million instructions per second.
Hmm. So the CPU with the higher clock rate actually has slightly worse performance. But still, which one do you think will sell better?

Categorizing Memory Hierarchy Misses

Before continuing, let's take a look at the reasons why accesses miss in caches. These reasons are true at any level of the memory hierarchy, from registers to disk. They are referred to as the "3 Cs" of cache misses:
  1. Compulsory. Compulsory misses are misses that could not possibly be avoided, e.g., the first access to an item. Cold-start misses are compulsory misses that happen when a program first starts up. Data has to come all the way through the memory hierarchy before it can be placed in a cache and used by the processor.
  2. Capacity. Capacity misses occur when the cache is smaller than the working set of blocks or pages in the program. The cache cannot contain all of the blocks, so some are evicted only to be brought back in later.
  3. Conflict. Conflict misses are caused by the block placement policy. Direct mapped caches are most prone to conflict misses. Even though the working set of blocks may be smaller than the cache, two blocks that both map to the same block are going to cause misses. Fully associative caches are immune to conflict misses. Misses that occur in a set-associative cache that wouldn't have occurred in the equivalent fully-associative cache are conflict misses.
Note that these 3 Cs extend beyond memory systems to other areas of computer science. Conflict misses are analagous to collisions in hash tables. A database system that tries to keep most of its structures in memory will have compulsory misses when it first starts up. The 3 Cs directly apply to branch predictors, as well.

Enhancing Cache Performance

Here we discuss three techniques for improving the performance of caches:

Nonblocking Caches

We have assumed that every cache miss will cause stall cycles. However, this need not be the case with a processor that executes out-of-order. Suppose there is a miss in the L1 cache, but the data is in the L2 cache. A nonblocking L1 cache can handle the miss while simultaneously allowing subsequent data accesses to proceed out-of-order. When the data is finally ready, the corresponding load can execute. If access to the offending item did not form a bottleneck in the flow of data, no stall cycles may have been needed. In any event, stall cycles due to cache misses will be reduced. This optimization increases the complexity of the cache controller, since it may have to handle several memory accesses at once. Nonblocking caches are generally useful only for L1 caches, since the latency between L2 caches and memory is so large that stall cycles will be unavoidable with limited ILP.

Hardware-Based Prefetching

The idea of prefetching is to get an item out of memory and store it in a buffer well before the processor requests the item. The item may be a data cache block or a fetch block. Here are some examples:

Software-Based Prefetching

Some ISAs provide prefetch hint instructions that direct the processor to begin a data access from a certain address well before the data is needed. There are two types of prefetch hints: Either kind of hint can be either faulting or nonfaulting, meaning that the access either can or cannot produce a page fault. Nonfaulting prefetches allow prefetching off the end of arrays and make the job of the compiler easier.

A common idiom is to specify prefetch hints as loads to register 0, where register 0 is understood to always contain the value zero. This way, the ISA isn't changed and the prefetch instruction has no effect on previous versions of the architecture.

Compilers often use loop unrolling to exploit prefetch instructions. This way, long-latency prefetches can be pipelined across multiple iterations of a loop.

Roth and Sohi studied jump-pointer prefetching for linked data structures. The idea here is to insert extra fields into each node of a linked data structure that point to other nodes that should be prefetched when that node is accessed. There has been considerable work into algorithms to effectively place prefetch instructions.

Virtual Memory

Let's move up a couple of levels in the memory hierarchy. Virtual memory is the idea of using main memory as a cache for a huge address space stored on secondary storage like hard disks. Since DRAMs are expensive relative to hard disks, this makes the same kind of sense as using SRAM caches for DRAM main memories. However, there are important differences between these two schemes:

The Main Idea

The main idea is that each process is provided with its own virtual address space, separate from other processes, and potentially larger than available main memory. Some concepts:

Virtual Memory Organization

Virtual memory can be characterized at a high-level in a similar way to cache organization with the following four questions:
  1. Block placement. Where can a block be placed in main memory? In caches, we had direct mapped, set associative, and fully associative. Fully associative had the lowest miss rates, but was the most expensive, so it was never used. However, with virtual memory, the cost of a miss is so high that it doesn't matter if fully associative takes longer, even if we can't do the searching in parallel as we can in hardware. Thus, fully associative is the only option for virtual memory.
  2. Block identification. How is a block found if it is in main memory? This is the problem of address translation. The following techniques can be used:
  3. Block replacement. Which block should be replaced on a page fault? Almost all operating systems replace the least-recently used block. Again, even though true LRU is expensive to maintain, it is far cheaper than allowing even a small increase in the miss rate.
  4. Write strategy. What happens on a write? The strategy is always write-back with a dirty bit, because the cost of writing-through is too high.
For next time, study for the exam.