Lecture 7: Caches
Memory Hierarchy
Computer systems often need a large amount of storage to accomplish their
goals. There are many different kinds of storage devices. As a rule,
faster storage devices cost more per byte of storage. Thus, multiple
levels of storage are used. Ideally, the resulting memory hierarchy
costs only a little more than using only the least expensive memory, but
delivers performance almost equal to using only the most expensive memory.
Memory Hierarchy
Type of Memory
|
Typical Size
|
Speed
|
Cost
|
Typical Organization
|
Technology
|
CPU Registers
|
128 bytes
|
1 cycle
|
Very high
|
Explicitly managed
|
High-speed latches
|
Level 1 D-Cache
|
8KB to 64KB
|
1 to 3 cycles
|
High
|
2-way set associative, 64 byte lines
|
6T or 8T SRAM
|
Level 1 I-Cache
|
8KB to 64KB
|
1 cycle
|
High
|
2-way set associative, 64 byte lines
|
6T or 8T SRAM
|
Level 2 Unified Cache
|
128KB to 2MB
|
6 to 20 cycles
|
Moderate
|
2-way set associative, 64 byte lines
|
6T SRAM
|
Main memory
|
4MB to a few GB
|
~100 cycles
|
Low, ~$.1/MB
|
Banks and Buffers
|
SDRAM
|
Disk
|
Hundreds of GB
|
~5,000,000 cycles
|
Very Low, ~$1/GB
|
Swap partition
|
Rotating platters
|
Caches
We will focus on the L1 and L2 caches in the memory hierarchy, although
much of the discussion can apply to any level. For now, let us imagine
that there is only one cache between the CPU and main memory. The basic
idea is to have a small, expensive cache memory hold frequently used
values from the larger, cheap main memory. This way, access to data
and instructions is significantly faster than it would be with just the
main memory. When an access to memory is available from the cache, it is
said that a cache hit has occurred. When an access is not in the
cache, a cache miss occurs. For a given cache access latency,
we want to minimize the miss rate, i.e., the fraction of accesses
to memory that miss in the cache.
Why do caches work?
The answer to this question is locality of references.
The distribution of data and instruction accesses to memory is highly
non-uniform. Although a program may touch many memory locations, most
accesses tend to focus on just a few memory locations. If we cache these
locations, we can make the average access a lot faster without having to
have a large cache. There are two kinds of locality that are important
for caches:
- Temporal locality. This is the observation that if memory
location i is accessed now, then memory location i is
likely to be accessed again in the near future.
- Spatial locality. This is the observation that if memory
location i is accessed now, then memory locations near i
are likely to be touched again in the near future.
Programs that don't exhibit a lot of locality tend not to perform well
on modern processors because they can't take advantage of the cache.
It's important for programmers interested in performance to be aware of
locality and, if possible, write their programs to take advantage of the
cache. For example, consider the following C function to multiply two
matrices:
void matrix_multiply (float A[N][K], float B[K][M], float C[N][M]) {
int i, j, l;
for (i=0; i<M; i++) for (j=0; j<N; j++) {
C[i][j] = 0.0;
for (l=0; l<K; l++)
C[i][j] += A[i][l] * B[l][j];
}
}
Two dimensional arrays in C are laid out in row-major order. This means
that it is laid out row by row, so that e.g. element C[0][0]
is adjacent to C[0][1] in memory. To do matrix multiplication,
we have to access each element of B N times.
It doesn't matter in what order we access the elements, as long as we pair
up the right dot product with the right element of C. The way
it is illustrates here, we are accessing only one element of each row of
B before moving on to the next row. Thus, there is no spatial
locality in the references to b.
How can we change the program to have more spatial locality? One way is
to rewrite out function to use the transpose of B instead:
void matrix_multiply2 (float A[N][K], float B[K][M], float C[N][M]) {
float Bt[M][K];
int i, j, l;
// Bt is the transpose of B
for (i=0; i<K; i++) for (j=0; j<M; j++) Bt[j][i] = B[i][j];
// now we work with Bt instead of B
for (i=0; i<M; i++) for (j=0; j<N; j++) {
C[i][j] = 0.0;
for (l=0; l<K; l++)
C[i][j] += A[i][l] * Bt[j][l];
}
}
Now the function accesses Bt with a very high degree of spatial
locality, since it accesses every element of one row of Bt before
moving on to the next row. Interestingly, the new function executes more
instructions and consumes more memory, but will take much less time on a
real processor (I have measured a factor of 2 speedup with this function
for large values of K).
This was a very simple example of how locality affects performance.
Compilers are able to transform more complex codes using blocking, tiling,
loop fusion/fission, and other optimizations to get more locality.
Let's quickly think about another example of how we can exploit locality.
Suppose you need to execute a certain number of independent transactions,
e.g., queries that, say, read a database or search engine, or maybe
simulate a time step in some simulation for a given set of inputs.
The transactions have the following properties:
- Transactions are buffered in a queue, and arrive as fast as we can
process them.
- Each transaction is composed of several phases that each exercise
different sets of data.
- Transactions don't depend on one another
- We are interested in maximum throughput; response time for a given
transaction is not as important.
If we evaluate one transaction after another, so that we execute each
phase one after another, we are walking all over memory and likely
giving the cache a hard time. However, if we rearrange the program,
we could do the following:
- Read N transactions from the queue into a buffer, where N
is a number we fine-tune empirically.
- Run the first phase of the transaction on each of the N
transactions.
- Run the second phase on each transaction, and so forth.
- Once the final phase has been done on each transaction, go back to
Step 1.
Cache Organization
Caches are organized as blocks of bytes. When data is transferred
from main memory to the cache, or from the cache to main memory, it is
transferred as a block. Blocks are typically larger than a memory word
because of spatial locality and the fact that subsequent accesses to
consecutive memory locations are faster than the first access in main memory.
The following questions give a high-level overview of the issues related
to caches:
- Block placement. Where can a block be placed in the cache?
- Block identification. How is a block found in the cache?
- Block replacement. Which block should be replaced on a miss?
- Write strategy. What happens on a write?
Each question has more than one possible answer.
Block Placement
There are three types of policies regarding block placement:
- Direct mapped. In a direct mapped cache, there is only one place
a particular block can go. Let N be the number of blocks in
the cache. A block at memory address x is typically mapped to
the block given by x modulo N. Think of this as a
hash table where collisions are solved by evicting an element from the
table.
- Fully associative. In a fully associative cache, the block can go
anywhere in the cache.
- Set associative. A set associative cache is composed of many
sets. A set contains a certain number of blocks. For instance,
a 4-way set associative cache would consist of many sets of four blocks
each. Suppose there are N sets. A block from memory address
x is mapped onto a set with a hash function, like x modulo
N. The block can be placed in any one of the blocks in the set.
This organization is halfway between direct mapped and fully associative.
Suppose a cache has N blocks. Then a direct mapped cache can be
though of as 1-way set associative cache with N sets, and a fully
associative caches can be thought of as a N-way set associative
cache with one set.
Why do you think we have these different kinds of organizations? What
are the advantages and disadvantages of each?
Block Identification
Suppose we are looking for the data at memory address x in the
cache. Once we arrive at the right set or block in the cache, how do we
know whether it contains that data, or some other data?
Besides data, each block has a tag associated with it the comes
from the address for which it holds the data. A block also has a valid
bit associated with it that tells whether or not the data in that
block is valid. An address from the CPU has the following fields used to
refer to data in the cache:
- Block address. This is the portion of the address used to identify the
block to the cache. It is divided into two portions:
- Set index. This portion indicates the set in the cache.
- Tag. This portion is stored with the block, and compared
against when the cache is accessed.
- Block offset. This is the portion used to identify individual words
or bytes within a block.
For example, let's say we have a 4-way set-associative cache with 256 sets,
and each set is 64 bytes. Thus, it can contain 64KB of data.
A 32-bit address from the CPU is divided into the following fields:
- Bits 0 through 5: Block index
- Bits 6 through 13: Set index
- Bits 14 through 31: Tag
How many bits of SRAM will be needed for this cache? Each block has 64
bytes = 512 bits of SRAM, plus one valid bit, plus 18 bits for the tag =
531 bits. There are 256 sets of 4 blocks each, giving 543744 total bits,
or about 66.4KB. So about 4% of the bits in the cache are for overhead.
Block replacement
When a cache miss occurs, we bring a block from main memory into the cache.
We have to have a policy for replacing some block in the cache with the
new block.
What is the policy for a direct mapped cache?
For caches with some amount of associativity, the following policies have
been used:
- Random. A pseudorandom number generator gives the identity of the
block to be replaced within a set. This spreads replacements uniformly
through the set, and is cheap to implement.
- Least-recently used (LRU). The block that was access least recently
within the set is replaced. This assumes that data that haven't been used
recently won't be used again in the near future.
- First in, first out (FIFO). This approximates LRU by replacing the
block that was replaced least recently, rather than accessed least recently.
This is cheaper to implement than LRU.
What do you think the effect of these policies is? Why not just always
use random, since it is cheap?
Write Strategy
When a write to memory comes from the CPU, there are two basic policies
the cache can use:
- Write through. The data is written to the cache as well as to main
memory.
- Write back. The data is stored in the cache only. The data is only
written back to memory when the block is replaced.
To implement write-back, each block has another bit of metadata associated
with it called a dirty bit. When the block is first fetched from
main memory, the dirty bit is cleared to 0. When the block is modified,
the dirty bit is set to 1. When the block is replaced, it is written out
to main memory only if the dirty bit is set to 1.
What are the advantages and disadvantages of these two policies? How do
they affect bus traffics and multiprocessor systems?
Cache Technology
How do caches store data? Caches are made from matrices of SRAM cells.
Let's consider, for simplicity, a direct mapped tagless cache of 16
kilobits It is arranged as a 128 x 128 matrix of SRAM cells.
- An index into the cache is 14 bits wide.
- This index is divided into 7 row address lines and 7 column address lines.
- The 7 column address lines are input to a column selector
that precharges one of the 128 column select lines to 1.
- The 7 row address lines are input to a row decoder that
sends Vdd to one of the 128 row select lines.
- The SRAM cell that has both its column select and row select lines
set to Vdd is selected for reading or writing.
- The transistors for each SRAM cell are very small so they can be
packed tightly. This causes them to be somewhat slow to read, because it
takes a long time for them to drive the output lines; special tricks are
used to mitigate this latency.
Let's look at what happens at the transistor level when a bit is read.
Here is a diagram of a six transistor (6T) SRAM cell:
The two inverters (i.e., the four middle transistors) provide storage for
a single bit. Think of the bit as continously going around and around
the two inverters in a counter-clockwise manner. The bit is the output
of the top inverter, and the complement of the bit is the output of the
bottom inverter as well as the input to the top inverter. The following
occurs when we want to read the bit stored in this SRAM cell:
- There are actually two column select lines. Each one is precharged
to 1 by the column selector logic. All other column select lines in the
cache are set to 0.
- The row select line for this cell is set to 1 by the row decoder logic.
All other row select lines are set to 0.
- At this point, the current from the row select line flows to two
NMOS transistors which begin to allow the bit and its inverse to flow out
of the SRAM cell to the column select lines. One of the column select
lines is slowly pulled to 0. The process is slow because the tiny SRAM
transistors have to drive the substantial column select lines that have
been precharged to 1. A senseamp at the bottom of the column select lines
quickly detects any difference in current and decides that the bit is a 0
if the left column select line is going to 0, or a 1 if the right column
select line is going to 0.
The following occurs if we want to write a bit x to the SRAM
cell:
- The row select line is set to 1, allowing current to flow between
the column select lines and the inverters.
- The left column select line is driven to x and the right
line is driven to the complement of x. The beefier drivers
are stronger than the inverters, so the value currently going around the
inverters is replaced with the new value.
- Other columns in the same row are set to a level of current we can
think of as being somewhere between 0 and 1. The result is that those
inverters not column-selected simply keep the values they had before.
For next time, we will continue in Chapter 5.