Read this question paper carefully. Write your answers in the bluebooks provided. Make sure to write your name on your bluebook. Do not write anything you want graded on the question paper. If you need another bluebook or if you have a question, please come to the front of the classroom. You may keep this question paper when you leave.
#define N 1000 void makex (int x[]) { int i, j, t; for (i=0; i<N; i++) x[i] = i; for (i=0; i<N; i++) { j = rand () % N; t = x[i]; x[i] = x[j]; x[j] = t; } } double foo (double A[], double B[]) { int i; int x[N]; makex (x); for (i=0; i<N; i++) A[x[i]] += B[x[i]]; }Assume that the compiler can prove that the A and B arrays do not overlap in memory in any calling context for foo. Recall that rand() is a C function that returns a pseudo-random integer between 0 and 231-1.
Solution: Yes. Rationale: The makex procedure randomly permutes an array of distinct integers. Thus, no two iterations of the for loop in foo refer to the same index, so each iteration could be carried out in parallel.
Solution: Yes. For example: void makex (int v[]) { int i; for (i=0; i<N; i++) x[i] = 0; } Now every iteration of the for loop in foo refers to the same two elements of A and B. This introduces a loop-carried dependence and could cause a race condition if different iterations of the loop were carried out on different processors.
#include <stdio.h> #include <string.h> #include <stdlib.h> #include <time.h> #define LOTS 1000000 /* a large number of items */ #define N 10 /* a number of bits */ #define DOMAIN (1<<N) /* 2 to the Nth power possible integers in N bits */ char A[DOMAIN]; /* A[i] is initially -1, then it is the N-bit parity of i */ int stuff[LOTS]; /* lots of stuff goes into this array */ /* compute the N-bit parity of x */ int f2 (int x) { int i, count = 0; for (i=0; i<N; i++) if (x & (1<<i)) count++; return count & 1; } /* return the N-bit parity of x either by computing or remembering it */ int f (int x) { char *p = &A[x]; if (*p == -1) *p = f2 (x); return *p; } int main () { int i, j, t0, t1, sum; /* initialize each element of A to -1 */ for (i=0; i<DOMAIN; i++) A[i] = -1; /* fill the stuff array with LOTS of random N-bit numbers */ for (i=0; i<LOTS; i++) stuff[i] = rand () % DOMAIN; sum = 0; for (j=0; j<10; j++) for (i=0; i<LOTS; i++) sum += f (stuff[i]); printf ("sum = %d\n", sum); exit (0); }This program fills the stuff array with many random values between 0 and 2N-1, then goes through the array 10 times computing the N-bit parity of each of these integers using the f function. Each time it finds the parity of a number, the program adds it to a running total.
The f function uses the A array to remember previously computed parity values; the first time the parity of an integer x is requested, it is computed with the f2 function and placed into the A array at the xth position. Each subsequent time the parity of x is requested, the value is simply read from the array rather than being computed again. (This is a common technique called memoization.) Let's call this program f.
An alternate way of running the program to get the same output is to not do the memoization at all, and change the body of the inner loop to sum += f2 (stuff[i]);. This will cause the parity to be recomputed every time, so there will be 10 times as many parity computations compared with the original program. Let's call this new modified program f2.
We can recompile the program with different values of N to get different behaviors. The following graph is a plot of the amount of time taken by f and f2 for different values of N on a real computer:
Solution: The working set of elements in the A array begins to exceed the capacity of the cache at N=18. At N=20, the random nature of the accesses to A completely overwhelms the capacity of the cache and recomputing the parity becomes faster since cache misses take so long.
Solution: The branch that continues the for loop in the f2 function is predicted perfectly by a two level branch predictor until the history length of the branch predictor is exceeded, at which point a single branch misprediction penalty is added to every call to f2. This is what is happening here: the history length for the predictor is exceeded at N=17.
Solution: I don't really know, but I wanted to see what you thought :-) It is probably due to the random accesses exceeding the capacity of the TLB, so many address translations after this point must go through the page table. I won't take points off for any answer to this question.
i % DOMAIN causes almost all of the accesses to the A array to be sequential, introducing spatial locality into the f function where there was no locality before. It is likely that the curve for f will remain under the curve for f2 for the entire length of the graph; if not, the new curve for f will at least rise very slowly relative to the curve for f in the original graph.
This problem is inspired by question 6.30 from the book which was assigned as a review problem. The answer to this question is basically the same as the answer to that problem, but with branch predictor entries instead of cache lines. Solution: Yes. There could be constructive interference in the branch predictor. For example, maybe there are 2 kinds of threads: A and B. In the new program, now there are two A and two B threads. If, say, the two A threads have similar behavior, then one thread might "train" the branch predictor in anticipation of the second thread.
Solution: Yes. There could be destructive interference in the branch predictor, i.e. mispredictions due to conflicts. If we increase the number of threads and each thread has different behavior, then we would expect the number of conflicts in the branch predictor to increase, i.e. multiple threads using the same counter in the prediction tables but with different behaviors.