Solution: First we find the cycles-per-instruction (CPI) for each machine: CPI_A = 1 + (.20 * .15 * 99) = 3.97 CPI_B = 1 + (.20 * .05 * 99) = 1.99 This is a measure of performance normalized to clock rate. Now we need to get the absolute performance in terms of time so we can compare the two. Let's compute nsPI, i.e., nanoseconds per instruction. We can do this by divided the CPI by the number of gigahertz (GHz) because on an X GHz system, one clock cycle is 1 / X nanoseconds. nsPI_A = CPI_A / GHz_A = 3.97 / 1.8 = 2.21 nsPI_B = CPI_B / GHz_B = 1.99 / 1.0 = 1.99 Machine B has a lower CPI than Machine A, so Machine B is faster.
The speedup is 2.21 / 1.99 ~= 1.11, or 11%.
Loop: L R2, 0(R1) ; R2 = memory[R1] L R3, 4(R1) ; R3 = memory[R1+4] DADD R4, R2, R3 ; R4 = R2 + R3 S R4, 0(R1) ; memory[R1] = R4 DADDIU R1, R1, #4 ; R1 = R1 + 4 DADDIU R2, R1, #-400 ; compare R1 to 400 BEQZ R2, Loop ; if R2 = 0 then goto LoopAssume that:
Solution: Let's look at a pipeline diagram for this problem: cycle 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 L R2, 0(R1) IF ID EX MM WB L R3, 4(R1) IF ID EX MM WB DADD R4, R2, R3 IF ID s s EX MM WB S R4, 0(R1) IF s s ID s s EX MM WB DADDIU R1, R1, #4 IF s s ID EX MM WB DADDIU R2, R1, #-400 IF ID s s EX MM WB BEQZ R2, Loop IF ID s s s s EX MM WB L R2, 0(R1) IF ID EX MM WB From the first fetch of the first load to the second fetch of the first load, there are 16 cycles. From the code and assumption that R1 starts as 0 it is clear that the loop iterates 100 times, so the total number of cycles is 1600 plus a few extra cycles to drain the pipeline.
Loop: L R2, 0(R1) ; R2 = memory[R1] L R3, 4(R1) ; R3 = memory[R1+4] DADDIU R1, R1, #4 ; R1 = R1 + 4 DADD R4, R2, R3 ; R4 = R2 + R3 DADDIU R2, R1, #-400 ; compare R1 to 404 S R4, -4(R1) ; memory[R1-4] = R4 BEQZ R2, Loop ; if R2 = 0 then goto Loop
Solution: Let's look at the pipeline diagram for the new code: cycle 01 02 03 04 05 06 07 08 09 10 11 12 13 L R2, 0(R1) IF ID EX MM WB L R3, 4(R1) IF ID EX MM WB DADDIU R1, R1, #4 IF ID EX MM WB DADD R4, R2, R3 IF ID EX MM WB DADDIU R2, R1, #-400 IF ID EX MM WB S R4, -4(R1) IF ID EX MM WB BEQZ R2, Loop IF ID EX MM WB L R2, 0(R1) ss IF ID EX MM WB Now there are no stall cycles except for the one waiting for the branch to resolve. The MM stage of the second load can forward to the EX stage of the DADD instruction because of the instruction in between. The branch no longer has to wait for the value of R2 to be computed because of forwarding and the fact that the store is moved between the DADDIU and the branch. From the first fetch of the first load to the second fetch of the first load there are 8 cycles. So this loop takes 100 * 8 = 800 cycles plus a few to drain the pipeline.
Solution: The speedup is 1600 / 800 = 2, or 100%
Solution: The block offset is log2(16) = 4. 32 address bits - 19 tag bits - 4 block offset bits = 9 set index bits. So there are 2^9 = 512 sets. Each set has 4 blocks, so there are 512 * 4 = 2048 blocks. Each block has 16 bytes, so there are 2048 * 16 = 32768 bytes, i.e. 32KB.
0x0567 0x03c6 0x18b7 0x0873 0x087c 0x0872 0x18ec 0x1874 0x18ba 0x17ab 0x0870 0x056c 0x18e1 0x0146 0x02c5 0x02c2How many of these loads will miss in the cache? Hint: the block offset is the least significant hex digit, and the set index is the next two hex digits.
Solution: The way to do this problem is to go through the addresses sequentially. The first time we see a set index, it is a miss. The next time we see it, it is a miss if the tag (i.e. the first digit) doesn't match the last time we saw the same set index. We can safely ignore the block offset because it is never > 12, so a memory access never spans two blocks. The cache is direct mapped so we don't have to worry about replacement or trying to find a match among several blocks in a set. Thus: 0x0567 miss - invalid block 0x03c6 miss - invalid block 0x18b7 miss - invalid block 0x0873 miss - invalid block 0x087c hit - set 87, tag 0 0x0872 hit - set 87, tag 0 0x18ec miss - invalid 0x1874 miss - set 87, tag 1 (!=0) 0x18ba hit - set 8b, tag 1 0x17ab miss - invalid block 0x0870 miss - set 87, tag 0 0x056c hit - set 56, tag 0 0x18e1 hit - set 8e, tag 1 0x0146 miss - invalid 0x02c5 miss - invalid 0x02c2 hit - set 2c, tag 0 So there are 10 misses.
int code (void) { int i, j; int c; i = 1; loop: j = 1; loop2: c = c + i + j; j++; if (j <= 3) goto loop2; i++; if (i <= 1000) goto loop; }The compiler will produce conditional branches for the two if statements.
Solution: For the outer loop branch, which is taken 999 times and not taken once, there will be 2 initial mispredictions as the counter has values 0 and 1 and predicts not taken, and then there will be 1 last misprediction as the loop exists and the counter is 3. So the outer loop branch contributes 3 mispredictions. The first iteration of the inner loop branch will look like this: old counter value prediction outcome new counter value 0 not taken taken 1 1 not taken taken 2 2 taken not taken 1 So the first iteration of the outer loop will see 3 mispredictions from the inner loop. The second iteration of the inner loop branch will look like this: old counter value prediction outcome new counter value 1 not taken taken 2 2 taken taken 3 3 taken not taken 2 So the second iteration of the outer loop branch see 2 mispredictions from the inner loop. Each subsequent iteration will look like this: old counter value prediction outcome new counter value 2 taken taken 3 3 taken taken 3 3 taken not taken 2So each of the 998 subsequent iterations of the outer loop branch will see 1 misprediction from the inner loop. So in total, there will be 3 + 3 + 2 + 1 * 998 = 1006 mispredictions.
Solution: The global history looks like this: code outcome if (j <= 3) goto loop2; taken if (j <= 3) goto loop2; taken if (j <= 3) goto loop2; not taken if (i <= 1000) goto loop2; taken resulting in a pattern like this: 110111011101110111011101..... The following histories are seen only once, at the beginning: history outcome of next branch result due to table value being initially 0 0000 taken misprediction 0001 taken misprediction 0011 not taken correct prediction 0110 taken misprediction Now we get into a steady state with the following histories: history outcome of next branch result due to table value being initially 0 1101 taken two initial mispredictions 1011 taken two initial mispredictions 0111 not taken no mispredictions 1110 taken two initial and one final misprediction Thus, in the steady state, all of the inner and outer loop branches are predicted correctly. The only mispredictions are a few initial ones and the final one that comes as the outer loop exists. Counding from the above tables, there are a total of 10 mispredictions.