Rutgers University CS 505 Spring 2006
Computer Structures
Midterm Examination Solutions

Read this question paper carefully. Write your answers in the bluebooks provided. Make sure to write your name on your bluebook. Do not write anything you want graded on the question paper. If you need another bluebook or if you have a question, please come to the front of the classroom. You may keep this question paper when you leave.

Writing your name: 2 points

  1. Draw the circuit diagram for an AND gate using PMOS and NMOS transistors. Use at most 6 transistors. Hint: like the other logic gates we have seen, some transistors will be wired in series and some will be wired in parallel.
    Solution (12 points): (click to enlarge)
    
    The leftmost four transistors form a NAND gate.  The two on the right
    form a NOT gate.  Note that NAND is 0 if and only if both a and b are 1.
    Thus the two n-type transistors at the left of the diagram allow a 0 to
    flow to the input of the NOT gate only when both a and b trigger the gate
    inputs of their respective n-type transistors.  NAND is 1 if either of a or
    b, or both, are 0.  Thus, the two p-type transistors in the middle of the
    diagram allow a 1 to flow to the input of the NOT gate if either a or b or
    both are 0, triggering the gate input of one or both p-type transistors.
    The NOT gate accepts the output of the NAND gate.  If it is 0, then the
    p-type transistor is triggered and a 1 flows to the output.  If it is 1,
    then the n-type transistor is triggered and a 0 flows to the output.
    

  2. Consider the following x86 assembly language subroutine:
    compute:
            pushl    %ebp
            movl     %esp,%ebp
            subl     $8,%esp
            andl     $0xfffffff0,%esp
    	movl	 $0,%ecx
            movl     $1,%edx
            movl     $3,%eax
    loop:
            addl     %edx,%ecx
            addl     %ecx,%edx
            decl     %eax
            jns     loop
            pushl   %edx
            pushl   $.L1
            call    printf
            leal    8(%esp),%esp
            xorl     %eax,%eax
            leave
            ret
    .L1:    .string "%d\n"
    
    This subroutine is similar to the one you decoded in your first homework. It is in the GNU assembler syntax, so the 'l' in e.g. xorl means "32-bit longword" and the destination of an instruction is in the last place, e.g. "addl %edx,%ecx" adds the contents of the two registers and places the result in %ecx.
  3. When we are designing a new ISA, having a lot of registers seems like a good idea. However, what are at least two disadvantages of having a large number of registers in the ISA?
    Solution (12 points):
    
    Here are some disadvantages:
    
    1.  A larger register file takes longer to access, and might increase the
    clock period for the microarchitecture.
    
    2.  Increasing the number of registers increases the number of bits needed
    to represent a register in an instruction.  This will cause problems
    representing instructions in a fixed-length word, or might necessitate
    moving to a larger instruction word.
    
    3.  On a context switch, all registers must be saved.  With more registers,
    the overhead of saving them is higher.
    
    There were some weaker answers that were also accepted.  Chief among these
    were "increasing chip complexity" and "more registers are more expensive (in
    terms of area or energy or whatever)."  You could say this about almost any
    new feature added to an ISA; that's why these answers are weaker.  (However,
    no points were taken off for correct but weak answers.  Points were taken
    for weak and incorrect answers.)
    
    
  4. Consider a 64KB 4-way set associative cache. The cache has blocks of 64 bytes, and there are 1 valid bit and 1 dirty bit per block. A physical address has 46 bits. How many total SRAM bits will be needed to implement this cache? Show your work.
    Solution (12 points):
    
    There are 64K bytes.  64K is 216.  64 is 26.  4 is 22.
    There are 216 / 26 = 210 total blocks.
    There are 210 / 22 = 28 sets.
    
    So a physical address can be broken down into the following fields:
    
    [ 32 bits of tag | 8 bits of set index | 6 bits of block offset ]
    
    32 + 8 + 6 = 46 so all address bits are accounted for.
    
    Each of the 210 blocks needs 32 = 25 bits of tag.
    In addition, a block needs 1 valid bit + 1 dirty bit = 2 extra bits.
    So the meta-data for the entire cache is:
    25 * 210 + 2 * 210 = 32 * 1024 + 2 * 1024 = 32768 + 2048 = 34816 bits.
    
    Each block consists of 64 bytes, or 64 * 8 = 26 * 23 = 29 bits.  
    
    So the data for the cache is 210 * 29 = 219 = 524288 bits.
    
    So the whole cache needs 524288 + 34816 = 559104 bits of SRAM.
    
  5. It has been observed that a small direct-mapped instruction cache can have fewer misses than a fully associative instruction cache using an LRU replacement policy, even though both caches have the same number of blocks and the same number of bytes per block. What program behavior can elicit such a counterintuitive result? Why?
    Solution (12 points):
    
    This was one of the questions from the review.  The solution is to notice
    that the pattern of instruction accesses in loops is at odds with the LRU
    replacement policy.  Suppose the fully associative cache has n blocks,
    and there is a loop that contains n+1 blocks.  When the loop reaches the
    branch that brings it back to the beginning, the next access will go to the
    least recently used block in the loop, i.e., the first block.  In a fully
    associative cache, this block will have been replaced to make room for
    one of the other n+1 blocks in the loop.  This will incur a cache miss,
    causing replacement of some other block.  That block will be the least
    recently used block, i.e., the next block in the loop.  However, the next
    access will be to that block, causing another cache miss: we just had that
    block in the cache and we threw it out, only to bring it right back in.
    Clearly this pattern of access will have a cascading effect and every
    single access in the loop will cause a cache miss.  Contrast this with a
    direct-mapped cache: there will be 1 or possibly 2 conflict misses in the
    loop, but every other access to a block in the loop will hit in the cache.
    
  6. The quantum, i.e., the maximum amount of time between context switches, has remained more or less the same for the last couple of decades. However, the number of instructions executed per second has increased steadily. What impact does this have on data cache hit rates? Why?
    Solution (12 points):
    
    Being able to execute more instructions in a quantum of time generally leads
    to higher data cache hit rates.  After a context switch, a process has a
    large part of its working set flushed from the cache by other processes.
    When the process begins to run again, it spends a certain number of
    instructions causing cache misses and refilling the cache with its
    working set, and the rest of the time with relatively fewer cache misses.
    When there are more instructions executed per quantum, this first phase
    involves relatively fewer instructions and the second phase involves
    relatively more, so more time is spent in a phase during which the hit
    rate is relatively higher.  
    
    The effect is the same as if we held the number of instructions executed
    per second constant, but decreased the quantum.  Each process would be
    interrupted more often and do less useful work and cause more cache misses.
    
  7. In the Patterson and Ditzel paper, several advantages were listed of RISC over CISC. Describe three of these advantages.
    Solution (12 points):
    
    - Implementation feasibility: being able to fit an entire CPU on a single
    chip.
    
    - Design time.  A chip that takes less time to design can take more advantage
    of advances in VLSI than a chip that takes longer.
    
    - Speed.  By reducing the complexity of the CPU, various critical paths
    in circuits will be shorter so the clock period can be shorter.
    
    - Better use of chip area.  A RISC is implemented with fewer transistors
    than a CISC.  Given the same chip area, the RISC could devote more area
    to an on-chip cache, improving performance.
    
    - Better mapping of high-level languages to machine language.  CISC
    instructions that are intended to support some high-level language concept
    often end up unused because the concept is implemented differently for
    different languages and/or the compiler cannot analyze the code well enough
    to apply the concept.  RISC avoids this by providing simpler instructions
    that are easier to compile for.
    
  8. Answer the following questions briefly, i.e., with just a few words or a single sentence: