Lecture 8: More Caches, Virtual Memory
Cache Design
Caches are used to help achieve good performance with slow main memories.
If we're interested in good performance, then why not build a huge,
fully-associative cache with single-cycle access? Unfortunately, it's
not so simple.
Cache Design Considerations
When designing a cache, there are three main things to consider:
The first two have a direct effect on performance. The last has an impact
on the cost of the processor or the availability of chip area for other
purposes. For L1 caches, the most important components are miss rate
and delay. For an inorder machine, we'd like to have single-cycle access
to the L1 cache; otherwise, every load incurs at least one stall cycle.
Out-of-order machines can relax this requirement a little, but instruction
throughput will suffer for every bit of delay in the cache.
The following factors affect miss rate and delay:
- Size of cache. A larger cache will have a lower miss rate and a
higher delay.
- Associativity. A cache with more associativity will have a lower miss
rate and a higher delay. The higher delay is due to extra multiplexers that
are used to implement associativity within sets. Even with the slightly
higher delay, it is usually worth it to have a set-associative cache.
- Block size. A cache with a higher block size may have lower delay,
but the miss rate will vary with the kind of workload. Programs with high
spatial locality will do well, but programs with poor spatial locality will
not. Note that block size also affects the number of transistors in the
cache. Higher block sizes need less decoder circuitry and fewer tag bits.
Tricks for Building Better Caches
Designers have come up with several tricks for decreasing or mitigating
the delay of caches. Here are a few:
- Pipelined caches. Some parts of a cache can be pipelined. Thus,
a cache with a multi-cycle delay may be able to deliver a cache block on
every cycle. The delay of any one cache access stays the same, but the
throughput increases because a new access may begin on every cycle; we
don't have to wait until one access is through before issuing a new access.
A pipelined cache has two kinds of delays:
- Access time. This is the amount of time that passes from the
time the address lines are written to the time the data is available.
- Cycle time. This is the minimum amount of time between which
two accesses can be initiated.
- Banked organizations. A cache can be divided into several
banks, each resembling a smaller cache. This has two advantages:
- Multiple cache accesses can be issued in the same cycle as
long as they reference distinct banks.
- Each bank has a lower access delay, so access to the cache can
be a little faster.
- Multi-ported caches. A cache can be augmented with extra bit lines to
provide multiple ports. A cache may have, say, two read ports and one write
port. This way two reads can be initiated in the same cycle, regardless
of the conflict issue raised by banked caches. Multi-ported caches are
more expensive in terms of area and delay, so there is a trade-off.
- Virtually addressed caches. One thing we haven't talked about so
far is address translation, but in a system with virtual memory (i.e., any
modern high-performance system), virtual addresses must be translated to
physical memory addresses. This address translation is on the critical path
to accessing the cache if the tags in the cache are derived from physical
addresses. Another option is to use virtual addresses to derive the tags.
This way, we can begin the cache access earlier by using the virtual address.
The downside is that two different processes may map the same virtual address
to two different physical locations, so the processor must flush the cache
on a context switch or keep process-specific information in the tags.
- Trace caches. A trace cache is a special kind of instruction cache.
We know that caches work best when there is a lot of locality. Basically,
a trace cache takes an instruction stream and makes it have more locality.
Instructions are stored as traces in the trace cache. That is,
they are stored in the order they were fetched, rather than in the order
in which they appear in memory. Along with a good branch predictor, a
trace cache can greatly increase instruction fetch bandwidth. The Pentium
4 uses a trace cache of micro-ops to help support high fetch rates at high
clock speeds.
An Example
Suppose we are building an inorder processor in 130nm technology for which
we want to have a very high clock rate so we can sell more processors.
We decide to have a small 8KB data cache to minimize delay. We want the
cache to have single-cycle access. We can choose between the following
configurations:
# Sets per bank
|
Block size
|
Associativity
|
Access time
|
Cycle time
|
Miss rate
|
# Banks
|
128
|
64B
|
Direct Mapped
|
690 ps
|
242 ps
|
10.3%
|
1
|
64
|
64B
|
2-way
|
898 ps
|
300 ps
|
7.5%
|
1
|
128
|
16B
|
4-way
|
906 ps
|
302 ps
|
6.5%
|
1
|
32
|
64B
|
Direct Mapped
|
603 ps
|
201 ps
|
10.3%
|
4
|
Which cache is the best choice in terms of performance? Unfortunately,
the answer is not clear and depends on other information we don't have yet.
Let's consider two possibilities:
- Clearly the 4-way associative cache has the lowest miss rate, so
it will deliver the highest IPC. However, since we have said that we
want the cache to have single-cycle access, we are limited to a clock
period of no more than 906 ps (actually, it will be a little higher
because of pipeline register delay). So the maximum clock rate is 1.0 /
(906 * 10^-12) = 1.1 GHz. Suppose the perfect cache (i.e. no misses at
all) yields an IPC of 1.0, and the cache miss rate is the only factor
affecting IPC. Suppose every cache miss incurs a 60ns penalty, and 20% of
all instructions are memory instructions. A 60ns miss penalty translates
into 66 cycles with a 1.1GHz clock rate. Stall cycles per instruction due
to cache misses will be 66 * 0.065 * 0.2 = 0.858, so the new CPI is 1.858.
Then the new IPC will be 1 / 1.858 = 0.538. The number of instructions
per second (IPS), which is the most important metric for performance,
is 0.538 IPC * 1.1 billion Hz = 592 million instructions per second.
- The direct-mapped cache with four banks is pretty fast, allowing a
maximum clock rate of 1.0 / (690 * 10^-12) 1.5 GHz. But the miss rate
is higher, so IPC will be lower. A 60ns miss penalty translates to 90
cycles for a 1.5GHz clock rate. The CPI is 1 + 90 * 0.103 * 0.2 = 2.854,
so the IPC is 0.350. Thus, the IPS is 0.350 * 1.5 billion = 525 million
instructions per second.
Hmm. So the CPU with the higher clock rate actually has slightly worse
performance. But still, which one do you think will sell better?
Categorizing Memory Hierarchy Misses
Before continuing, let's take a look at the reasons why accesses miss
in caches. These reasons are true at any level of the memory hierarchy,
from registers to disk. They are referred to as the "3 Cs" of cache
misses:
- Compulsory. Compulsory misses are misses that could not
possibly be avoided, e.g., the first access to an item. Cold-start
misses are compulsory misses that happen when a program first starts up.
Data has to come all the way through the memory hierarchy before it can
be placed in a cache and used by the processor.
- Capacity. Capacity misses occur when the cache is smaller than
the working set of blocks or pages in the program. The cache cannot contain
all of the blocks, so some are evicted only to be brought back in later.
- Conflict. Conflict misses are caused by the block
placement policy. Direct mapped caches are most prone to conflict misses.
Even though the working set of blocks may be smaller than the cache,
two blocks that both map to the same block are going to cause misses.
Fully associative caches are immune to conflict misses. Misses that occur
in a set-associative cache that wouldn't have occurred in the equivalent
fully-associative cache are conflict misses.
Note that these 3 Cs extend beyond memory systems to other areas of computer
science. Conflict misses are analagous to collisions in hash tables.
A database system that tries to keep most of its structures in memory will
have compulsory misses when it first starts up. The 3 Cs directly apply
to branch predictors, as well.
Enhancing Cache Performance
Here we discuss three techniques for improving the performance of caches:
- Nonblocking Caches
- Hardware Prefetching
- Software Prefetching
Nonblocking Caches
We have assumed that every cache miss will cause stall cycles. However,
this need not be the case with a processor that executes out-of-order.
Suppose there is a miss in the L1 cache, but the data is in the L2
cache.
A nonblocking L1 cache can handle the miss while simultaneously
allowing subsequent data accesses to proceed out-of-order. When the data
is finally ready, the corresponding load can execute. If access to the
offending item did not form a bottleneck in the flow of data, no stall
cycles may have been needed. In any event, stall cycles due to cache misses
will be reduced. This optimization increases the complexity of the cache
controller, since it may have to handle several memory accesses at once.
Nonblocking caches are generally useful only for L1 caches, since the
latency between L2 caches and memory is so large that stall cycles will
be unavoidable with limited ILP.
Hardware-Based Prefetching
The idea of prefetching is to get an item out of memory and store
it in a buffer well before the processor requests the item. The item may
be a data cache block or a fetch block. Here are some examples:
- Instruction stream buffers. An instruction stream typically exhibits
a large amount of predictable locality. Some processors fetch two blocks
from the instruction cache, consuming one and storing the other in a
special buffer. If the next fetch block is in the buffer, it can be
accessed immediately on the next cycle, bypassing the instruction cache.
Adding more capacity to the buffer can increase the utility of the
instruction stream buffer.
- Data stream buffers. Many programs read data items from multiple data
streams. For instance, recall the matrix multiplication code we examined
in an earlier class. There are very regular accesses to three streams:
two for reading and one for writing. By prefetching the data streams,
the latency of accesses to main memory can be mitigated. Even accesses
with low spatial locality can be prefetched if their strides are regular.
Palacharla and Kessler found that eight stream buffers could avoid 50%
to 70% of all cache misses from a split I+D 128KB cache.
- Prefetching for linked data structures. Linked data structures such
as linked lists and binary trees can be prefetched even through they
are laid out irregularly in memory. The main idea is to use the pointers
in the data structure to initiate prefetching. This idea is limited and
has its fullest utility with compiler or programmer assistance.
Software-Based Prefetching
Some ISAs provide prefetch hint instructions that direct the
processor to begin a data access from a certain address well before the
data is needed. There are two types of prefetch hints:
- Register prefetch hints. The prefetch loads the data into a register.
- Cache prefetch hints. The prefetch loads the data into a cache block.
Either kind of hint can be either faulting or nonfaulting,
meaning that the access either can or cannot produce a page fault.
Nonfaulting prefetches allow prefetching off the end of arrays and make
the job of the compiler easier.
A common idiom is to specify prefetch hints as loads to register 0, where
register 0 is understood to always contain the value zero. This way, the
ISA isn't changed and the prefetch instruction has no effect on previous
versions of the architecture.
Compilers often use loop unrolling to exploit prefetch instructions.
This way, long-latency prefetches can be pipelined across multiple iterations
of a loop.
Roth and Sohi studied jump-pointer prefetching for linked data structures.
The idea here is to insert extra fields into each node of a linked data
structure that point to other nodes that should be prefetched when that
node is accessed. There has been considerable work into algorithms to
effectively place prefetch instructions.
Virtual Memory
Let's move up a couple of levels in the memory hierarchy. Virtual memory
is the idea of using main memory as a cache for a huge address space stored
on secondary storage like hard disks. Since DRAMs are expensive relative
to hard disks, this makes the same kind of sense as using SRAM caches for
DRAM main memories. However, there are important differences between
these two schemes:
- The microarchitecture still needs to have very fast access to the cache,
so storing things like virtual address tags in DRAM won't do. We need to
do virtual address translation as quickly as we do L1 cache accesses.
- While an L2 cache miss is maybe 100 times more expensive than an L1
cache hit, a virtual memory page fault (i.e., a miss) is more like 50,000
times more expensive than an L2 cache miss. Thus, we need to be a lot
smarter about avoiding misses. Fortunately, we have a lot more time to
think about how to avoid misses. This job is usually delegated to the
operating system. Miss rates in virtual memory systems are around 0.0001%,
as opposed to 1-10% for caches. Pretty good, huh? No. I'd rather have
1000 L1 cache misses than 1 page fault.
- The size of virtual memory is determined by the width of addresses
the processor can generate, but the size of a cache is limited by more
immediate technology constraints such as chip area and delay.
- Virtual memory shares the disks with file systems. In fact, virtual
memory can be implemented on top of a file system, or alongside a file
system.
The Main Idea
The main idea is that each process is provided with its own virtual
address space, separate from other processes, and potentially
larger than available main memory. Some concepts:
- Physical pages of memory are mapped to virtual pages that are
kept either in memory or on disk. For instance, a virtual address 0x89ab
might be kept in memory at physical address 0x12ab, or somewhere else in
physical memory, or on the disk. For our purposes, we will assume that
pages have a fixed size, e.g. 512 bytes.
- When a process issues a memory reference, it does so with a virtual
address. This address is translated to a physical address by the processor,
or causes a page fault because the corresponding page isn't located
in physical memory. In this case, some page in memory is replaced with
the right page from the disk and execution continues.
- Now it's much harder for a process to run out of memory, and it's
also hard for one process to trample another process's address space.
Capacity and protection are both provided by virtual memory.
- The CPU can still run in real mode with no address translation.
This is necessary so that operating systems can be implemented. Some weird
applications may also run in real mode.
Virtual Memory Organization
Virtual memory can be characterized at a high-level in a similar way
to cache organization with the following four questions:
- Block placement. Where can a block be placed in main memory?
In caches, we had direct mapped, set associative, and fully associative.
Fully associative had the lowest miss rates, but was the most expensive,
so it was never used. However, with virtual memory, the cost of a miss
is so high that it doesn't matter if fully associative takes longer,
even if we can't do the searching in parallel as we can in hardware.
Thus, fully associative is the only option for virtual memory.
- Block identification. How is a block found if it is in
main memory? This is the problem of address translation. The following
techniques can be used:
- A page table contains entries for every virtual
block giving the physical address of that block, if it is mapped.
This requires a lot of storage that must be kept in main memory.
- An inverted page table is a hash table that has
only as many entries as there are main memory pages. It is
smaller than a page table.
- A translation lookaside buffer is a structure on
the processor similar to a cache that holds recently performed
address translations so that most translations can be done at the
speed of an L1 cache, as opposed to the speed of main memory.
A translation lookaside buffer caches the physical addresses
of virtual addresses, along with information about whether the
virtual page is dirty.
- Block replacement. Which block should be replaced on a page
fault? Almost all operating systems replace the least-recently used block.
Again, even though true LRU is expensive to maintain, it is far cheaper
than allowing even a small increase in the miss rate.
- Write strategy. What happens on a write? The strategy is
always write-back with a dirty bit, because the cost of writing-through
is too high.
For next time, study for the exam.