Lecture 8: More Caches, Virtual Memory

Cache Design

Caches are used to help achieve good performance with slow main memories. If we're interested in good performance, then why not build a huge, fully-associative cache with single-cycle access? Unfortunately, it's not so simple.

Cache Design Considerations

When designing a cache, there are three main things to consider:

Miss rate.
Delay.
Area.

The first two have a direct effect on performance. The last has an impact on the cost of the processor or the availability of chip area for other purposes. For L1 caches, the most important components are miss rate and delay. For an inorder machine, we'd like to have single-cycle access to the L1 cache; otherwise, every load incurs at least one stall cycle. Out-of-order machines can relax this requirement a little, but instruction throughput will suffer for every bit of delay in the cache. The following factors affect miss rate and delay:

Size of cache. A larger cache will have a lower miss rate and a higher delay.
Associativity. A cache with more associativity will have a lower miss rate and a higher delay. The higher delay is due to extra multiplexers that are used to implement associativity within sets. Even with the slightly higher delay, it is usually worth it to have a set-associative cache.
Block size. A cache with a higher block size may have lower delay, but the miss rate will vary with the kind of workload. Programs with high spatial locality will do well, but programs with poor spatial locality will not. Note that block size also affects the number of transistors in the cache. Higher block sizes need less decoder circuitry and fewer tag bits.

Tricks for Building Better Caches

Designers have come up with several tricks for decreasing or mitigating the delay of caches. Here are a few:

Pipelined caches. Some parts of a cache can be pipelined. Thus, a cache with a multi-cycle delay may be able to deliver a cache block on every cycle. The delay of any one cache access stays the same, but the throughput increases because a new access may begin on every cycle; we don't have to wait until one access is through before issuing a new access. A pipelined cache has two kinds of delays:
1. Access time. This is the amount of time that passes from the time the address lines are written to the time the data is available.
2. Cycle time. This is the minimum amount of time between which two accesses can be initiated.
Banked organizations. A cache can be divided into several banks, each resembling a smaller cache. This has two advantages:
1. Multiple cache accesses can be issued in the same cycle as long as they reference distinct banks.
2. Each bank has a lower access delay, so access to the cache can be a little faster.
Multi-ported caches. A cache can be augmented with extra bit lines to provide multiple ports. A cache may have, say, two read ports and one write port. This way two reads can be initiated in the same cycle, regardless of the conflict issue raised by banked caches. Multi-ported caches are more expensive in terms of area and delay, so there is a trade-off.
Virtually addressed caches. One thing we haven't talked about so far is address translation, but in a system with virtual memory (i.e., any modern high-performance system), virtual addresses must be translated to physical memory addresses. This address translation is on the critical path to accessing the cache if the tags in the cache are derived from physical addresses. Another option is to use virtual addresses to derive the tags. This way, we can begin the cache access earlier by using the virtual address. The downside is that two different processes may map the same virtual address to two different physical locations, so the processor must flush the cache on a context switch or keep process-specific information in the tags.
Trace caches. A trace cache is a special kind of instruction cache. We know that caches work best when there is a lot of locality. Basically, a trace cache takes an instruction stream and makes it have more locality. Instructions are stored as traces in the trace cache. That is, they are stored in the order they were fetched, rather than in the order in which they appear in memory. Along with a good branch predictor, a trace cache can greatly increase instruction fetch bandwidth. The Pentium 4 uses a trace cache of micro-ops to help support high fetch rates at high clock speeds.

An Example

Suppose we are building an inorder processor in 130nm technology for which we want to have a very high clock rate so we can sell more processors. We decide to have a small 8KB data cache to minimize delay. We want the cache to have single-cycle access. We can choose between the following configurations:


# Sets per bank	Block size	Associativity	Access time	Cycle time	Miss rate	# Banks
128	64B	Direct Mapped	690 ps	242 ps	10.3%	1
64	64B	2-way	898 ps	300 ps	7.5%	1
128	16B	4-way	906 ps	302 ps	6.5%	1
32	64B	Direct Mapped	603 ps	201 ps	10.3%	4

Which cache is the best choice in terms of performance? Unfortunately, the answer is not clear and depends on other information we don't have yet. Let's consider two possibilities:

Clearly the 4-way associative cache has the lowest miss rate, so it will deliver the highest IPC. However, since we have said that we want the cache to have single-cycle access, we are limited to a clock period of no more than 906 ps (actually, it will be a little higher because of pipeline register delay). So the maximum clock rate is 1.0 / (906 * 10^-12) = 1.1 GHz. Suppose the perfect cache (i.e. no misses at all) yields an IPC of 1.0, and the cache miss rate is the only factor affecting IPC. Suppose every cache miss incurs a 60ns penalty, and 20% of all instructions are memory instructions. A 60ns miss penalty translates into 66 cycles with a 1.1GHz clock rate. Stall cycles per instruction due to cache misses will be 66 * 0.065 * 0.2 = 0.858, so the new CPI is 1.858. Then the new IPC will be 1 / 1.858 = 0.538. The number of instructions per second (IPS), which is the most important metric for performance, is 0.538 IPC * 1.1 billion Hz = 592 million instructions per second.
The direct-mapped cache with four banks is pretty fast, allowing a maximum clock rate of 1.0 / (690 * 10^-12) 1.5 GHz. But the miss rate is higher, so IPC will be lower. A 60ns miss penalty translates to 90 cycles for a 1.5GHz clock rate. The CPI is 1 + 90 * 0.103 * 0.2 = 2.854, so the IPC is 0.350. Thus, the IPS is 0.350 * 1.5 billion = 525 million instructions per second.

Hmm. So the CPU with the higher clock rate actually has slightly worse performance. But still, which one do you think will sell better?

Categorizing Memory Hierarchy Misses

Before continuing, let's take a look at the reasons why accesses miss in caches. These reasons are true at any level of the memory hierarchy, from registers to disk. They are referred to as the "3 Cs" of cache misses:

Compulsory. Compulsory misses are misses that could not possibly be avoided, e.g., the first access to an item. Cold-start misses are compulsory misses that happen when a program first starts up. Data has to come all the way through the memory hierarchy before it can be placed in a cache and used by the processor.
Capacity. Capacity misses occur when the cache is smaller than the working set of blocks or pages in the program. The cache cannot contain all of the blocks, so some are evicted only to be brought back in later.
Conflict. Conflict misses are caused by the block placement policy. Direct mapped caches are most prone to conflict misses. Even though the working set of blocks may be smaller than the cache, two blocks that both map to the same block are going to cause misses. Fully associative caches are immune to conflict misses. Misses that occur in a set-associative cache that wouldn't have occurred in the equivalent fully-associative cache are conflict misses.

Note that these 3 Cs extend beyond memory systems to other areas of computer science. Conflict misses are analagous to collisions in hash tables. A database system that tries to keep most of its structures in memory will have compulsory misses when it first starts up. The 3 Cs directly apply to branch predictors, as well.

Enhancing Cache Performance

Here we discuss three techniques for improving the performance of caches:

Nonblocking Caches
Hardware Prefetching
Software Prefetching

Nonblocking Caches

We have assumed that every cache miss will cause stall cycles. However, this need not be the case with a processor that executes out-of-order. Suppose there is a miss in the L1 cache, but the data is in the L2 cache. A nonblocking L1 cache can handle the miss while simultaneously allowing subsequent data accesses to proceed out-of-order. When the data is finally ready, the corresponding load can execute. If access to the offending item did not form a bottleneck in the flow of data, no stall cycles may have been needed. In any event, stall cycles due to cache misses will be reduced. This optimization increases the complexity of the cache controller, since it may have to handle several memory accesses at once. Nonblocking caches are generally useful only for L1 caches, since the latency between L2 caches and memory is so large that stall cycles will be unavoidable with limited ILP.

Hardware-Based Prefetching

The idea of prefetching is to get an item out of memory and store it in a buffer well before the processor requests the item. The item may be a data cache block or a fetch block. Here are some examples:

Instruction stream buffers. An instruction stream typically exhibits a large amount of predictable locality. Some processors fetch two blocks from the instruction cache, consuming one and storing the other in a special buffer. If the next fetch block is in the buffer, it can be accessed immediately on the next cycle, bypassing the instruction cache. Adding more capacity to the buffer can increase the utility of the instruction stream buffer.
Data stream buffers. Many programs read data items from multiple data streams. For instance, recall the matrix multiplication code we examined in an earlier class. There are very regular accesses to three streams: two for reading and one for writing. By prefetching the data streams, the latency of accesses to main memory can be mitigated. Even accesses with low spatial locality can be prefetched if their strides are regular. Palacharla and Kessler found that eight stream buffers could avoid 50% to 70% of all cache misses from a split I+D 128KB cache.
Prefetching for linked data structures. Linked data structures such as linked lists and binary trees can be prefetched even through they are laid out irregularly in memory. The main idea is to use the pointers in the data structure to initiate prefetching. This idea is limited and has its fullest utility with compiler or programmer assistance.

Software-Based Prefetching

Some ISAs provide prefetch hint instructions that direct the processor to begin a data access from a certain address well before the data is needed. There are two types of prefetch hints:

Register prefetch hints. The prefetch loads the data into a register.
Cache prefetch hints. The prefetch loads the data into a cache block.

Either kind of hint can be either faulting or nonfaulting, meaning that the access either can or cannot produce a page fault. Nonfaulting prefetches allow prefetching off the end of arrays and make the job of the compiler easier.

A common idiom is to specify prefetch hints as loads to register 0, where register 0 is understood to always contain the value zero. This way, the ISA isn't changed and the prefetch instruction has no effect on previous versions of the architecture.

Compilers often use loop unrolling to exploit prefetch instructions. This way, long-latency prefetches can be pipelined across multiple iterations of a loop.

Roth and Sohi studied jump-pointer prefetching for linked data structures. The idea here is to insert extra fields into each node of a linked data structure that point to other nodes that should be prefetched when that node is accessed. There has been considerable work into algorithms to effectively place prefetch instructions.

Virtual Memory

Let's move up a couple of levels in the memory hierarchy. Virtual memory is the idea of using main memory as a cache for a huge address space stored on secondary storage like hard disks. Since DRAMs are expensive relative to hard disks, this makes the same kind of sense as using SRAM caches for DRAM main memories. However, there are important differences between these two schemes:

The microarchitecture still needs to have very fast access to the cache, so storing things like virtual address tags in DRAM won't do. We need to do virtual address translation as quickly as we do L1 cache accesses.
While an L2 cache miss is maybe 100 times more expensive than an L1 cache hit, a virtual memory page fault (i.e., a miss) is more like 50,000 times more expensive than an L2 cache miss. Thus, we need to be a lot smarter about avoiding misses. Fortunately, we have a lot more time to think about how to avoid misses. This job is usually delegated to the operating system. Miss rates in virtual memory systems are around 0.0001%, as opposed to 1-10% for caches. Pretty good, huh? No. I'd rather have 1000 L1 cache misses than 1 page fault.
The size of virtual memory is determined by the width of addresses the processor can generate, but the size of a cache is limited by more immediate technology constraints such as chip area and delay.
Virtual memory shares the disks with file systems. In fact, virtual memory can be implemented on top of a file system, or alongside a file system.

The Main Idea

The main idea is that each process is provided with its own virtual address space, separate from other processes, and potentially larger than available main memory. Some concepts:

Physical pages of memory are mapped to virtual pages that are kept either in memory or on disk. For instance, a virtual address 0x89ab might be kept in memory at physical address 0x12ab, or somewhere else in physical memory, or on the disk. For our purposes, we will assume that pages have a fixed size, e.g. 512 bytes.
When a process issues a memory reference, it does so with a virtual address. This address is translated to a physical address by the processor, or causes a page fault because the corresponding page isn't located in physical memory. In this case, some page in memory is replaced with the right page from the disk and execution continues.
Now it's much harder for a process to run out of memory, and it's also hard for one process to trample another process's address space. Capacity and protection are both provided by virtual memory.
The CPU can still run in real mode with no address translation. This is necessary so that operating systems can be implemented. Some weird applications may also run in real mode.

Virtual Memory Organization

Virtual memory can be characterized at a high-level in a similar way to cache organization with the following four questions:

Block placement. Where can a block be placed in main memory? In caches, we had direct mapped, set associative, and fully associative. Fully associative had the lowest miss rates, but was the most expensive, so it was never used. However, with virtual memory, the cost of a miss is so high that it doesn't matter if fully associative takes longer, even if we can't do the searching in parallel as we can in hardware. Thus, fully associative is the only option for virtual memory.
Block identification. How is a block found if it is in main memory? This is the problem of address translation. The following techniques can be used:
- A page table contains entries for every virtual block giving the physical address of that block, if it is mapped. This requires a lot of storage that must be kept in main memory.
- An inverted page table is a hash table that has only as many entries as there are main memory pages. It is smaller than a page table.
- A translation lookaside buffer is a structure on the processor similar to a cache that holds recently performed address translations so that most translations can be done at the speed of an L1 cache, as opposed to the speed of main memory. A translation lookaside buffer caches the physical addresses of virtual addresses, along with information about whether the virtual page is dirty.
Block replacement. Which block should be replaced on a page fault? Almost all operating systems replace the least-recently used block. Again, even though true LRU is expensive to maintain, it is far cheaper than allowing even a small increase in the miss rate.
Write strategy. What happens on a write? The strategy is always write-back with a dirty bit, because the cost of writing-through is too high.

For next time, study for the exam.