Rutgers University CS 505 Spring 2006 Midterm Examination

Rutgers University CS 505 Spring 2006
Computer Structures
Second Examination

Read this question paper carefully. Write your answers in the bluebooks provided. Make sure to write your name on your bluebook. Do not write anything you want graded on the question paper. If you need another bluebook or if you have a question, please come to the front of the classroom. You may keep this question paper when you leave.

Consider the following C code:

#define N 1000
void makex (int x[]) {
        int     i, j, t;

        for (i=0; i<N; i++)
                x[i] = i;
        for (i=0; i<N; i++) {
                j = rand () % N;
                t = x[i];
                x[i] = x[j];
                x[j] = t;
        }
}

double foo (double A[], double B[]) {
        int     i;
        int     x[N];

        makex (x);
        for (i=0; i<N; i++)
                A[x[i]] += B[x[i]];
}

Assume that the compiler can prove that the A and B arrays do not overlap in memory in any calling context for foo. Recall that rand() is a C function that returns a pseudo-random integer between 0 and 2³¹-1.

Is the for loop in the foo function parallel?

Solution: Yes.

Rationale: The makex procedure randomly permutes an array of distinct
integers.  Thus, no two iterations of the for loop in foo refer to the
same index, so each iteration could be carried out in parallel.

Is it possible to rewrite the makex function so that the answer to the preceeding question is different? If so, how? If not, why not?

Solution: Yes.  For example:

void makex (int v[]) {
	int i;
	for (i=0; i<N; i++) x[i] = 0;
}

Now every iteration of the for loop in foo refers to the same two elements
of A and B.  This introduces a loop-carried dependence and could cause
a race condition if different iterations of the loop were carried out on
different processors.

Consider the following C program:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

#define LOTS	1000000	/* a large number of items */
#define N	10	/* a number of bits */
#define DOMAIN	(1<<N)	/* 2 to the Nth power possible integers in N bits */

char A[DOMAIN]; /* A[i] is initially -1, then it is the N-bit parity of i */
int stuff[LOTS]; /* lots of stuff goes into this array */

/* compute the N-bit parity of x */
int f2 (int x) {
	int	i, count = 0;
	for (i=0; i<N; i++) if (x & (1<<i)) count++;
	return count & 1;
}
/* return the N-bit parity of x either by computing or remembering it */
int f (int x) {
	char *p = &A[x];
	if (*p == -1) *p = f2 (x);
	return *p;
}
int main () {
	int i, j, t0, t1, sum;

	/* initialize each element of A to -1 */
	for (i=0; i<DOMAIN; i++) A[i] = -1;

	/* fill the stuff array with LOTS of random N-bit numbers */
	for (i=0; i<LOTS; i++) stuff[i] = rand () % DOMAIN;

	sum = 0;
	for (j=0; j<10; j++) 
		for (i=0; i<LOTS; i++) 
			sum += f (stuff[i]);
	printf ("sum = %d\n", sum);
	exit (0);
}

This program fills the stuff array with many random values between 0 and 2^N-1, then goes through the array 10 times computing the N-bit parity of each of these integers using the f function. Each time it finds the parity of a number, the program adds it to a running total.

The f function uses the A array to remember previously computed parity values; the first time the parity of an integer x is requested, it is computed with the f2 function and placed into the A array at the xth position. Each subsequent time the parity of x is requested, the value is simply read from the array rather than being computed again. (This is a common technique called memoization.) Let's call this program f.

An alternate way of running the program to get the same output is to not do the memoization at all, and change the body of the inner loop to sum += f2 (stuff[i]);. This will cause the parity to be recomputed every time, so there will be 10 times as many parity computations compared with the original program. Let's call this new modified program f2.

We can recompile the program with different values of N to get different behaviors. The following graph is a plot of the amount of time taken by f and f2 for different values of N on a real computer:

It is clear from the graph that up until N=20, the memoized version of the program has better performance, but at N≥20, the non-memoized version is better. Why is this?

Solution: The working set of elements in the A array begins to exceed the
capacity of the cache at N=18.  At N=20, the random nature of the accesses
to A completely overwhelms the capacity of the cache and recomputing the
parity becomes faster since cache misses take so long.

The curve for f2 looks linear until N=18 at which point it jumps up to another more-or-less linear region. What is the cause for this behavior?

Solution: The branch that continues the for loop in the f2 function is
predicted perfectly by a two level branch predictor until the history
length of the branch predictor is exceeded, at which point a single
branch misprediction penalty is added to every call to f2.  This is what
is happening here: the history length for the predictor is exceeded at N=17.

The curve for f climbs from N=18 until N=25 at which point the rate of change reaches a local minimum. Then the rate of change goes up again, causing a weird "knee" in the curve around N=25. What is the cause for this behavior?

Solution: I don't really know, but I wanted to see what you thought :-)

It is probably due to the random accesses exceeding the capacity of the TLB,
so many address translations after this point must go through the page table.
I won't take points off for any answer to this question.

Suppose that the argument to f in the f program is changed to i % DOMAIN (instead of stuff[i]). Suppose the same change is made in the same line of the f2 program. How would this affect the curves in the graph above?

i % DOMAIN causes almost all of the accesses to the A array to be sequential,
introducing spatial locality into the f function where there was no locality
before.  It is likely that the curve for f will remain under the curve for
f2 for the entire length of the graph; if not, the new curve for f will
at least rise very slowly relative to the curve for f in the original graph.

Consider an SMT system capable of running 4 threads simultaneously. This system has a single branch predictor that is shared among all the threads. A parallel program runs 2 threads on this system. It has a certain misprediction rate in the branch predictor, i.e., a certain percentage of the conditional branches are predicted incorrectly. Now suppose the program runs the same algorithm but instead uses 4 threads.

Is it possible that the misprediction rate could decrease when the program moves from 2 to 4 threads? Why or why not?

This problem is inspired by question 6.30 from the book which was assigned
as a review problem.  The answer to this question is basically the same
as the answer to that problem, but with branch predictor entries instead
of cache lines.

Solution: Yes.  There could be constructive interference in the branch
predictor.  For example, maybe there are 2 kinds of threads: A and B.
In the new program, now there are two A and two B threads.  If, say,
the two A threads have similar behavior, then one thread might "train"
the branch predictor in anticipation of the second thread.

Is is possible that the misprediction rate could increase when the program moves from 2 to 4 threads? Why or why not?

Solution:  Yes.  There could be destructive interference in the branch
predictor, i.e.  mispredictions due to conflicts.  If we increase the
number of threads and each thread has different behavior, then we would
expect the number of conflicts in the branch predictor to increase,
i.e. multiple threads using the same counter in the prediction tables but
with different behaviors.

Rutgers University CS 505 Spring 2006 Computer Structures Second Examination

Rutgers University CS 505 Spring 2006
Computer Structures
Second Examination