CS 5513 Fall 2008, Homework 4
Part I of this homework is due at the beginning of class, October 15, 2008.
Part II of this homework is due at the beginning of class, October 29, 2008.
Cache Simulators
This assignment consists of two programming problems. The first problem
is to modify the xocolatl emulator to include a cache simulator.
The second is to write a C program that can determine the configuration
of the L1 and L2 caches for some x86 computers to which you will be
given access.
- For this problem, modify the xocolatl emulator to include
a cache simulator:
- Download the
xocolatl-cache
emulator for the MC6809 processor. Unpack it into your home directory
on one of the UTCS machine with tar -xvzf xocolatl.tar.gz .
Note that this is a new version of the emulator that has been augmented
with hooks for a cache simulator.
- Modify the file xocolatl/src/projects/cache.java to
simulate a cache as described below.
- Compile xocolatl with your modifications and run it on the benchmarks
in the main xocolatl directory. Use the cache configurations specified below.
- Turn in your code as well as a writeup of observations and experiences including two bar graphs:
- One that shows the miss rate for each benchmark for each cache configuration, as well as the arithmetic mean over all benchmarks for each configuration.
- One that shows the CPI for each benchmark for each cache configuration, as well as the geometric mean over all benchmarks for each configuration.
In your writeup, write about your observations of the effect of different
cache parameters on miss rate and performance.
Your code should produce no output; it should only simulate the
cache.
around in memory and do computation. If you like to have debug output,
please disable it in the code you use to turn in your assignment.
Programming Interface and What to Simulate
You should modify xocolatl/projects/cache.java to simulate a cache
and accurately report the number of misses.
The file xocolatl/projects/cache.java has the following methods
defined:
- cache (int, int, int, int); This constructor accepts four
integer arguments: the size of one cache block in bytes, the number of
blocks per set, the number of sets in the cache, and the replacement
policy where 0 means none, 1 means random, 2 means least-recently-used
(LRU), and 3 means first-in-first-out (FIFO).
- void load (int); This method is called by the simulator
every time a byte is loaded from memory. The single parameter is the
address of the load in memory. Addresses are 19 bits.
- void store(int); This method is called every time there is
a store to memory. Again, the single parameter is a 19-bit address.
- long report_misses(); This method returns the number of
cache misses that have occurred.
The command-line arguments of the xoco program have been augmented
to accept a -cache option followed by three integers and one string
specifying the simulated cache's block size, associativity (i.e. number
blocks per set), number of sets, and replacement policy (one of "lru",
"fifo", "random", or "none" for direct-mapped). The emulator automatically
calls the constructor with these parameters and translates the name of a
replacement policy to a number from 0 to 3 so all you have to do is write
the simulator code in cache.java.
Hints for writing a precise simulator
Pay attention to these hints so your simulator will compute the exact
number of misses for each benchmark and configuration:
- Initially (i.e. implement this in the constructor), all cache blocks
are invalid. A cache block becomes valid the first time something is placed
into it from memory. Each block should have a "valid" bit associated with
it that is true if the block is valid, false otherwise.
- For every replacement policy, an invalid block should be replaced
before any valid block. That is, before you begin using the replacement
policy to decide which block to replace, you should see if there is an
invalid block that can be used first.
- Notice that the load and store instructions don't
indicate what the actual value being loaded or stored is. It doesn't matter;
all the simulator is concerned with is statistics about the number of misses,
so the actual values stored in the cache don't need to be represented by
the program.
- "Least recently used" means the one block out of all the blocks in
a set that was loaded from or stored to the longest time ago. How you
implement this policy is up to you; there is more than one way to do it
but they all result in the same number of misses.
- "First-in-first-out" means the one block out of all the blocks in
a set that was replaced the longest time ago. Again, how you implement
this is up to you.
- You will need to keep a count of the number of misses encountered
by your simulator. Make sure this count is a long, since it is
possible that the number of memory accesses could exceed the capacity of
an int.
- You will find that your simulations will be time-consuming. Be patient
and try to code efficiently. You might want to parallel-process, i.e., log
onto more than one machine and run different benchmarks or configurations
simultaneously.
Compiling and running xocolatl
As in Homework 2,
to compile xocolatl together with the code you have written
in xocolatl/src/projects/cache.java, change to the directory
xocolatl/src and type make.
There are 5 benchmarks:
- compiler - A BASIC compiler compiling itself.
- interpreter - A BASIC interpreter running a BASIC compiler on a sorting program.
- linker - An assembler/linker program assembling and linking the output of the BASIC compiler with the BASIC runtime.
- lisp - A LISP interpreter computing the 12th Fibonacci number with an inefficient recursive function.
- sort - A sorting program doing a bubble sort on pseudo-random numbers.
You can run the benchmarks with the simulator using the run-benchmark
script located in the xocolatl directory. The first argument to run-benchmark should be the name of the benchmark, and the second argument
should be, in quotes, a command-line argument to the simulator to pass
the cache configuration. For example, to run the sort benchmark
with a 2-way set associative cache that has 8-byte blocks, 16 sets,
and an LRU replacement policy, you would type the following from
the xocolatl directory:
./run-benchmark sort "-cache 8 2 16 lru"
(The quotes are necessary so that command-line option to the emulator
won't be parsed by the run-benchmark shell script. Note also that the
replacement policy e.g. "lru" should be in lowercase.)
Each benchmark will run for a while, periodically printing output related
to the running simulated program, and then the simulator will finish with
statistic related to the cache simulation. In the case of running sort with the above configuration, the last few lines of output look like this:
executed 27569591 instructions
9915309 misses
116242901 accesses
8.530% miss rate
1019100491 estimated cycles
36.965 estimated CPI
As you can see, the emulator reports the number of instructions executed,
the number of simulated cache misses, the number of memory accesses,
the simulated cache miss rate, an estimated number of cycles, and an
estimated CPI. The estimated number of cycles and CPI are based on a
machine model where every instruction takes one cycle and every cache miss
has a penalty of 100 cycles (this is computed automatically by the emulator
so your code doesn't need to do anything special about cycles or CPI).
Configurations
In your writeup, report the results of running all 5 benchmarks with
the following configurations:
- 1KB cache, 8 byte blocks, 8-way set associativity, LRU replacement.
- 768 byte cache, 8 byte blocks, 6-way set associativity, LRU replacement.
- 1KB cache, 8 byte blocks, 4-way set associativity, LRU replacement.
- 1KB cache, 8 byte blocks, 2-way set associativity, LRU replacement.
- 1KB cache, 8 byte blocks, 4-way set associativity, random replacement.
- 1KB cache, 8 byte blocks, 2-way set associativity, random replacement.
- 1KB cache, 8 byte blocks, 4-way set associativity, fifo replacement.
- 1KB cache, 8 byte blocks, 2-way set associativity, fifo replacement.
- 1KB cache, 4 byte blocks, 4-way set associativity, LRU replacement.
- 1KB cache, 1 byte blocks, 4-way set associativity, LRU replacement
- 1KB cache, 8 byte blocks, direct-mapped
- 2KB cache, 8 byte blocks, direct-mapped
- 1KB cache, 4 byte blocks, direct-mapped
- 1KB cache, 8 byte blocks, fully associative, random replacement
Clearly, you will have to do some arithmetic to translate the configurations
as specified above into the parameters expected by the simulation.
Examples
These examples will help you develop your simulator. These are only
examples; you should turn in results for all of the benchmarks for all of
the configurations above.
- 2KB cache, 8 byte blocks, 2-way set associativity, LRU replacement.
2KB is 2048 bytes, 2048 / 8 = 256 blocks, 256 / 2 = 128 sets, so to
run the compiler benchmark we would type
./run-benchmark compiler "-cache 8 2 128 lru"
When the simulation finished, the last few lines should look like this:
executed 62815127 instructions
4702985 misses
222067984 accesses
2.118% miss rate
533113627 estimated cycles
8.487 estimated CPI
- 1KB cache, 8 byte blocks, direct-mapped. There are 1024 / 8 = 128 blocks,
so for compiler we run:
./run-benchmark compiler "-cache 8 1 128 none"
and the last few lines of output are:
executed 62815127 instructions
11121057 misses
222067984 accesses
5.008% miss rate
1174920827 estimated cycles
18.704 estimated CPI
Note that the second parameter for the "-cache" option is 1, i.e., 1 block per set, since it is a direct-mapped cache.
- 4KB cache, 16-byte blocks, 8-way set-associative, random replacement.
There are 4096 / 16 = 256 blocks, 256 / 8 = 32 sets, so for sort we
run
./run-benchmark sort "-cache 16 8 32 random"
and the last few lines of output are:
executed 27569591 instructions
3755 misses
116242901 accesses
0.003% miss rate
27945091 estimated cycles
1.014 estimated CPI
- The same example as above, but FIFO replacement on the lisp
benchmark:
./run-benchmark lisp "-cache 16 8 32 fifo"
...
executed 34373897 instructions
273094 misses
112687694 accesses
0.242% miss rate
61683297 estimated cycles
1.794 estimated CPI
- 768 byte cache, 16-byte blocks, 6-way set associative, fifo replacement:
./run-benchmark lisp "-cache 16 6 8 fifo"
...
executed 34373897 instructions
3387548 misses
112687694 accesses
3.006% miss rate
373128697 estimated cycles
10.855 estimated CPI
- 2KB cache, 16-byte blocks, fully associative, LRU replacement.
A 2KB cache with 16 byte blocks has 128 blocks. We can simulate a
fully-associative cache as having one big set with associativity equal to
the number of blocks. So:
./run-benchmark linker "-cache 16 128 1 lru"
...
executed 113214151 instructions
5497880 misses
426017473 accesses
1.291% miss rate
663002151 estimated cycles
5.856 estimated CPI
Note that, for these examples with lru, direct-mapped, and fifo replacement,
your simulator should match exactly the number of misses.
With random replacement, your result should be close, but depending on
your implementation of randomness might be slightly different.
Note: If you are running the emulator after having logged in
remotely, you might need to set your DISPLAY environment variable
to ":0.0" to fool the JVM into believing it has a screen before it can run.
See the README file in xocolatl/ for information about
using the video display option.
- Write a C program to determine the block size, associativity, and
number of sets for the L1 and L2 caches of x86 machines to which you will
be given access:
- Download the following files:
- You can develop your program on your CS department Linux account or
some other i386 Linux installation to which you have access. Move all
the files into one directory and compile it by typing make.
- Use testcache.c as a starting point. This program runs the
Sieve of Eratosthenes for at least half a billion instructions for arrays
ranging in sizes from 1KB to 8MB. For each array size, it computes the
CPI for the sieve function in the file sieve.S. The CPI increases
with the array size because of increasingly poor spatial locality.
- Modify testcache.c such that after it is finished testing
each array size, it records each CPI in an array. Then, it attempts to
determine the cache capacity, set size, block size, and associativity by
testing a variety of such configurations in a memory hierarchy simulator.
- The function sieve2 is functionally equivalent to the optimized
code in sieve.S. Modify sieve2 to make a call to your
memory hierarchy simulator for each array reference. The memory hierarchy
simulator is implemented as a function that returns the latency of a memory
access to a given address, taking into account a hit and/or miss in the
L1 and/or L2 caches, as well as the latencies of these simulated caches.
For instance, for the statement v[i] = 1;, you would add something
like total_latency += memory_access (&v[i]);.
- Each configuration will result in a sequence of CPI values.
The configuration with the highest correlation to the CPI from the
real machine is likely to be the configuration of the actual machine.
Once you have tested all configurations, report the most highly correlated
configuration for the L1 and L2 caches.
- You have been given login accounts on each of the following
machines: valdepenas.cs.utsa.edu, malaga.cs.utsa.edu,
and catalunya.cs.utsa.edu. Your password on each machine is
your banner ID without the leading '@' sign; you will need to change your
password on each machine when you first log in.
- Using your program, determine the L1 and L2 cache configurations for
each of these machines. Your program may not run longer than 20 minutes
on any of these machines; if your program runs longer you will suffer
a great loss of points and your processes on those machines are subject
to being deleted.
Hints
- Develop a cache abstract data type so that you can use the same code to
manipulate both the L1 and L2 cache simulators. Your memory_access
function then calls the L1 simulator to see if there is a hit; if so,
the latency is the L1 hit latency. If not, check the L2 cache; if there
is a hit, the latency is the L2 hit latency. Otherwise, the latency is
that of memory.
- Your memory hierarchy and cache access ADT functions could look something
like this:
void init_cache (cache *c, int nsets, int assoc, int blocksize, int hitlat, int mislat);
int cache_access (cache *c, unsigned int address);
int memory_access (cache *l1, cache *l2, unsigned int address);
How do you make a pointer like &v[i] into an unsigned
integer? Use a typecast. It is important to use the same data
addresses used in the real CPI computation (to the extent possible) so
that the same conflicts will manifest themselves in the correct cache
configuration.
- The following code will find the correlation coefficient between two double arrays of size n:
double corr (double x[], double y[], int n) {
double sumxy = 0, sumx = 0, sumy = 0, sumx2 = 0, sumy2 = 0;
double num, dterm1, dterm2;
int i;
for (i=0; i<n; i++) {
sumx += x[i];
sumy += y[i];
sumxy += x[i] * y[i];
sumx2 += x[i] * x[i];
sumy2 += y[i] * y[i];
}
num = sumxy - ((sumx * sumy) / n);
dterm1 = (sumx2 - ((sumx * sumx) / n));
dterm2 = (sumy2 - ((sumy * sumy) / n));
return num / sqrt (dterm1 * dterm2);
}
Correlation coefficients range from -1 to 1, with 1 being highest
positive correlation. Thus, the cache configuration that yields the
highest correlation coefficient between the simulated CPI array and the
real CPI array is likely to be the correct configuration.
- What should the various latencies be? It's up to you; look at some
documentation for recent x86 processors and make an educated guess about
the right values. Your program must work unmodified on each computer,
so you will have to use values that apply more-or-less in general.
- There are thousands of possible configurations; how can your program
run in less than 20 minutes? Three things:
- Assume the L1 cache is at most 128KB with set associativity
of at most 8 and a block size of at most 32.
- Assume the L2 cache is at most 8192KB with set associativity
of at most 32 and a block size of at most 128.
- Determine the L1 cache size first by assuming a huge L2 cache
and finding the correlation only up to 256KB, then following that
by determining the L2 size using your fixed L1 cache.
Constraints
The output of your program should be brief and make it clear what the
capacity, blocks per set, block size, and number of sets is for both
L1 and L2. Your program must run in at most 20 minutes on even the
slowest of the machines. Your program may not probe the computer
hardware to determine the cache sizes; it must use the behavior
of the unmodified sieve function as its only way of probing
the machine. All of your code must be contained in a single source file
named testcache.c. Your program must be in C and must compile
and run on the computers mentioned above.
Coordinating Work
We must share all the machines among the 32 people in the class. Do your
development on a different machine, e.g. the computers in the Main Lab.
Use the class newsgroup, utsa.cs.5513.d, to coordinate usage of the
machine. Use the program 'tin -r' on the machines in the Main Lab to
access the newsgroup. Read messages others have posted, then post
a message saying e.g., "I plan to use valdepenas from 4:00pm to 5:00pm
on October 4." You can see if someone else is using the same machine by
typing w. If you see someone else is already working, log off.
Multiple users might perturb each others' results. If you run out of time
on the machines then you can't test your program; this is not an excuse for
a poor program as you have had three weeks to work on this assignment.
You may not work together on this assignment with other classmates or
receive assistance from any person other than your professor or teaching
assistant. Do not use code that isn't your own work for this assignment.
Do not allow other classmates access to your code. Do not allow anyone
access to a computer where your assignment is stored.
Your assignment should be nicely typed or word-processed. You must
turn in your writeup on paper in class. You must turn in your Java
and C code electronically by emailing your cache.java file
and testcache.c file to the teaching assistant before
the beginning of class. The files you email should be sufficient to run
your programs, so do not email any other files. Your Java and C code
will be graded not only for whether or not it works, but also for clarity
and documentation.
Part I of this homework, including the Java code and writeup, is due at the beginning of class, October 15, 2008.
Part II of this homework, including the C code and writeup, is due at the beginning of class, October 29, 2008.
Late assignments will not be accepted.