CS 3853 Spring 2010 Midterm Exam Solutions

Write your answers on this question paper. Your answers must be clear and legible. You may not use your book, your notes, or any other such material during this exam. You may use a calculator. Use the space provided to show how you arrived at your answer for partial credit. Only the answers you write in the spaces provided will be graded; you may use the back of the pages for your own private calculations. If a question seems ambiguous, state any assumptions you need to make to work the problem.

  1. True/False. For each of the following answer 'T' for True or 'F' for False.
    1. _F_ In the classic five-stage pipeline, if branches are executed in the decode (ID) stage, then there is a three-cycle stall for every branch.
    2. _F_ Larger page sizes are good for virtual memory because they reduce the effect of fragmentation in memory.
    3. _F_ The RISC ISA for the second homework assignment has 16 floating point registers.
    4. _F_ Pipeline registers are usually held in R1 and R2 for MIPS.
    5. _F_ Moore's Law states that the number of devices per unit area will increase quadratically over time.
    6. _T_ A modern microprocessor might execute many instructions in the time it takes to do one main memory access.
    7. _T_ WAR hazards do not occur with the classic five-stage pipeline.
    8. _F_ Write buffers temporarily store executed branches.
    9. _F_ A page table must fit into the cache.
    10. _T_ A typical page size is 4096 bytes.

  2. Consider two machines, machine A and machine B. For both machines, all instructions except for loads take one cycle. Loads take one cycle plus an additional "cache penalty." Machine A has a clock rate of 1.0 GHz and a cache penalty of 5 cycles. Machine B has a clock rate of 2.0 GHz and a cache penalty of 20 cycles. Loads are 33% of all instructions.

      Which machine is faster, machine A or B?

      Let's compute the CPI for the first machine:
      CPI_A = 1 + 0.33 * 5 = 2.65
      Now let's compute the nsPI (nanoseconds per instruction) for the first machine:
      nsPI_A = 1ns/cycle * 2.65 = 2.65ns
      
      Let's compute the CPI for the second machine:
      CPI_B = 1 + 0.33 * 20 = 7.60
      Now let's compute the nsPI for the second machine:
      nsPI_B = 1ns/2cycles * 7.60 = 3.8ns
      
      So machine A is faster, even though it has a lower clock rate.
      
      What is the speedup?
      3.8 / 2.65 = 1.44
      
      So the speedup is 1.44, or 44%
      

  3. Assume the classic 5-stage pipeline, with registers being written in the first half of WB and read in the last half of ID. There are two memory ports, one for instructions and one for data. Consider the following code:
     
            L      R9, 0(R5)
            DADD   R1, R9, R8
            DSUB   R2, R1, R9
            DADD   R3, R1, R2
    
    1. How many cycles does the code take, from first fetch to last write-back, assuming no forwarding?
      f = fetch
      d = instruction decode / read registers
      x = execute / compute effective address
      m = memory
      w = write back registers
      
                                                   1 1 1 1 1
                                 1 2 3 4 5 6 7 8 9 0 1 2 3 4
              L      R9, 0(R5)   f d x m w
              DADD   R1, R9, R8    f d s s x m w
              DSUB   R2, R1, R9      f s s d s s x m w
              DADD   R3, R1, R2        f s s s s d s s x m w
      
      14 cycles
      
    2. How many cycles would the code take, from first fetch to last write-back, if we allow forwarding?
                                 1 2 3 4 5 6 7 8 9
              L      R9, 0(R5)   f d x m w
              DADD   R1, R9, R8    f d x s m w
              DSUB   R2, R1, R9      f d s x m w
              DADD   R3, R1, R2        f s d x m w
      
      9 cycles
      

  4. Consider a 64KB cache with 256 sets. Addresses are 32 bits. Tags are 19 bits.
    1. What is the number of bytes per block for this cache?
      Let's first figure out how many set index bits there are: 
      log2 256 = 8 set index bits
      
      Now we can figure out how many offset bits there are: 
      32 bits total - (19 tag bits + 8 index bits) = 5 offset bits
      
      Now we know how many bytes per block there are:
      25 = 32 bytes per block
      
      
    2. What is the associativity (i.e. blocks per set) for this cache?
      Let's first figure out how many blocks there are:
      
      64KB = 65536 bytes / 32 bytes per block = 2048 blocks
      
      Now we can figure out how many blocks per set there are:
      
      2048 blocks / 256 sets = 8 blocks per set
      
    3. For the address 0x12345678, what is the tag?
      In binary, this number is: 00010010001101000101011001111000
      
      Let's separate this number into the 5-bit offset, 8-bit set index, and 19-bit tag:
      tag=[0001001000110100010], index=[10110011], offset=[11000]
      
      So the tag is 1001000110100010 in binary, or 91a2 in hex, or 37282 in decimal.
      
      Can we do this without converting to binary?  Yes: divide the address by
      the number of sets times the number of bytes per block, i.e. 0x12345678 /
      (32 * 256) = 37282.