CS 5513: Computer Architecture
Lecture 1: Discussion, Circuits, Metrics, Goals

Discussion

Let's have a discussion about performance. We'll begin with an example designed to show you where computer architecture can be relevant to you as a programmer.

Our objective is a favorite toy example of mine. Let's write a program that counts the prime numbers between 0 and 10,000,000. We'd like a fast program, so we'll go through a few ideas. We'll do this in Java. If you don't know Java, just pretend it is C or C++. If you don't know any of these, learn one as soon as possible.

First idea: test each number for primality using a little function

class primes {
	static final int N = 10000000;

	public static void main (String args[]) {
		int	i, j, l;

		// count the number of integers from 2..N that are prime

		for (j=0,i=2; i<=N; i++)
			if (is_prime (i)) j++;
		System.out.println (j);
	}

	// return true if and only if n is prime

	static boolean is_prime (int n) {
		int	i, l;

		if (n == 2) return true;

		// trial division by up to square root of n

		l = (int) Math.sqrt (n) + 1;
		for (i=2; i<=l; i++) if (n % i == 0) return false;

		// must be prime

		return true;
	}
}

This takes 53.6 seconds on my laptop, a Pentium 4M running at 2.0MHz.

Second idea: Modify the little function to recognize that we only need to divide by 2 and odd numbers to check for primality.

class primes2 {
	static final int N = 10000000;

	public static void main (String args[]) {
		int	i, j, l;

		// count the number of integers from 2..N that are prime

		for (j=0,i=2; i<=N; i++)
			if (is_prime (i)) j++;
		System.out.println (j);
	}

	// return true if and only if n is prime

	static boolean is_prime (int n) {
		int	i, l;

		if (n == 2 || n == 3) return true;

		// even numbers > 2 aren't prime

		if (n % 2 == 0) return false;

		// trial division by odd numbers up to square root of n

		l = (int) Math.sqrt (n) + 1;
		for (i=3; i<=l; i+=2) if (n % i == 0) return false;

		// must be prime

		return true;
	}
}

This takes 26.7 seconds, a speedup of 2.0

Third idea: Modify the little function to only divide by numbers that are of the form 6i+/- 1; it's easy to show that only these numbers (and 2 and 3) can be prime. Let's also only consider numbers of these form when counting.

class primes3 {
	static final int N = 10000000;

	public static void main (String args[]) {
		int	i, j, l;

		// count the number of integers from 2..N that are prime

		j = 0;

		// count first three primes

		if (N >= 2) j++;
		if (N >= 3) j++;

		// count all primes, considering only numbers of the
		// form 6 * i +/- 1 

		for (i=6; i<=N; i+=6) {
			if (is_prime (i + 1)) j++;
			if (is_prime (i - 1)) j++;
		}
		System.out.println (j);
	}

	// return true if and only if n is prime

	static boolean is_prime (int n) {
		int	i, l;

		// 2 and 3 are prime

		if (n == 2 || n == 3) return true;

		// divisible by 2 or 3?  not prime.

		if (n % 2 == 0 || n % 3 == 0) return false;

		// trial division by up to square root of n,
		// only numbers of the form 6 * i +/- 1

		l = (int) Math.sqrt (n) + 1;
		for (i=6; i<=l; i+=6) {
			if (n % (i + 1) == 0) return false;
			if (n % (i - 1) == 0) return false;
		}

		// must be prime

		return true;
	}

}

This takes 18.0 seconds, a speedup of 3.0 over the original and 1.5 over the second one. Wow, that's a lot better, but we're about out of ideas at this point. We could extend the idea to only test for numbers of the form 30i +/- 1, 7, 11, and 13, but that's getting silly. What are the time complexities of the first three algorithms? O(N * sqrt(N)). We've just been improving the constant factor. What we need is an improvement in the algorithm. Let's use Eratosthenes' sieve.

class sieve1 {
	static final int N = 10000000;
	static boolean A[];

	public static void main (String args[]) {
		int	i, j, l;
		A = new boolean[N+1];

		// do a sieve of Eratosthenes

		for (i=0; i<=N; i++) A[i] = true;
		l = (int) Math.sqrt (N);

		// for each number i from 2 to square root of N...

		for (i=2; i<=l; i++) 

			// ...mark off all the multiples of i

			for (j=i*i; j<=N; j+=i) A[j] = false;

		// count whatever is left; these are all the primes

		for (i=2,j=0; i<=N; i++) if (A[i]) j++;
		System.out.println (j);
	}
}

This program is very simple. It doesn't take advantage of odd numbers or any weird properties of integers (although it could if we tweaked it). However, it takes only 4.79 seconds. A speedup of 11.2 over the original and 3.75 over the previous. Why is this so much faster? Because we improved the algorithm, not just the constant factor. This algorithm has a running time of O(N log(sqrt(N))). (Why? First see that the running time is O(sum_{i=2}^sqrt(N) N/i). Now factor out the N; what's left is a harmonic series going up to sqrt(N) and missing the unit term, thus it is bounded from above by log(sqrt(N))). By now you're thinking, to heck with all this math, let's see some computer architecture! OK, here you go:

class sieve2 {
	static final int N = 10000000;
	static final int BLOCKSIZE = 65536;
	static boolean A[];

	public static void main (String args[]) {
		int	i, j, k, l, count, base, range;
		A = new boolean[N+1];

		// do a blocked sieve of Eratosthenes

		count = 0;

		// go through the numbers from 2..N by BLOCKSIZE

		for (base=2; base<=N; base+=BLOCKSIZE) {
			for (i=0; i<BLOCKSIZE; i++) A[i] = true;

			// find all the primes in base..base+BLOCKSIZE-1

			range = base + BLOCKSIZE;
			if (range > N) range = N;
			l = 1 + (int) Math.sqrt (range);
			for (i=2; i<=l; i++) {

				// let k be the first multiple of i
				// greater than base that isn't i

				k = base + (i-base%i)%i;
				if (k == i) k += i;

				// mark off all multiples of i starting at k

				for (j=k; j<range; j+=i) A[j-base] = false;
			}

			// count up what we have so far

			for (i=base; i<range; i++) if (A[i-base]) count++;
		}
		System.out.println (count);
	}
}

This takes 0.64 seconds, a speedup of 83.8 over the original and 7.5 over the previous. This program is adapted from the original sieve, but instead of sieving over all the numbers from 2..N at once, it does it in blocks of 65,536. It does some extra computations, actually executing quite few more instructions than the original sieve. So why is it way faster, delivering a much better speedup than all the other ones? Can you think of ways to make it even faster? Let's discuss.

The point is, having some knowledge of the microarchitecture that is implementing your program can help you make it execute more efficiently. However, you must be careful: make sure your algorithm is good before trying to tweak it to use the microarchitecture more efficiently. Premature optimization is the root of all evil. We could have tried optimizing the heck out of the first prime counting function, but we would never have been able to get it to the speed of the blocked sieve because of the innate inefficiency of the algorithm.

CS 5513

CS 5513, Computer Architecture, is a graduate level class on computer architecture. Let's talk about computer architecture. Here are some questions to think about:

What does "computer architecture" mean?
What are some examples of computer architecture you are familiar with?
What are some of the issues and changes in computer architecture?
Describe the computer you first learned to program or use. How does it compare with the computer(s) you use today?
Is it important to make computers go faster? Why or why not?
How can a system composed of lots of transistors do things like arithmetic?

Administrivia

Class web page is http://www.cs.utsa.edu/~dj/cs5513/index.html. A syllabus is available there.

Circuits

Speaking of transistors, Our focus will be on higher-level issues, but we must always keep in mind that these ideas have some sort of physical basis. There are three main types of components we are interested in:

Transistors (computing).
Wires (transporting).
Memories (storing).

Voltage is used to represent logical values 0 and 1. For us, it doesn't matter what the voltages are, just that there are two distinguishable voltages that we can simply name 0 and 1. Sometimes it helps to have 0 be equal to 0 volts.

Transistors

Transistors are tiny switches. They have three terminals:

Gate (G)
Drain (D)
Source (S)

When the gate terminal is triggered, current is allowed to flow from the source to the drain. Otherwise, no current flows from the source to the drain.

There are two types of MOS transistors used in the CMOS technology that dominates current computer architecture:

NMOS - triggered when 1 is appplied to G.
```
         D
        _|
  G --||_
         |
         S
```
NMOS transistors are good at passing 0's, but not good at passing 1's.
PMOS - triggered when 0 is applied to G.
```
         D
        _|
  G -o||_
         |
         S
```
PMOS transistors are good at passing 1's, but not 0's.

How do we compute with these devices? We'd like to be able to work with AND, OR, NOT, etc., not with "turn on" and "bad at passing 0."

For example, consider the NOT function:

in  out
--  ---
0    1
1    0

Here is an example of a NOT gate (or inverter) implemented with transistors:

              1                               
             _|                               
        |-o||_                                
___in___|     |__out___                       
        |    _|                               
        |--||_                             
              |                               
              0

How about a more complex function? The NAND function is important in computer architecture for a number of reasons. Why? The truth table is:

How would we implement this in transistors?

Goals for the class

Understand the how and why of computer architectures.
- Instruction Set Architecture
- System organization (processor, memory, I/O)
- Microarchitecture
Learn methods of measuring and improving performance
- Metrics
- Benchmarks
- Performance methods (pipelining, ILP, prediction)
Discover the state of the art in computer architecture
- Readings
- Projects

What is Computer Architecture?

Interface design.
Implementation. Issues related to the physical realization of the defined ISA.
- Organization. High-level aspects of computer design, such as the high-level design of the memory system, I/O, and CPU.
- Hardware. Low-level aspects of the implementation. Gate-level and transistor-level design. Packaging.

Interface Design

An interface may last through several implementations (e.g. x86 ISA).
Instruction Set Architecture is an important interface: Defining the interface between the software that provides instructions and the hardware that implements them.
Interfaces are visible. Implementations aren't directly visible.
Three types of interfaces:
- Between layers, e.g. API, ISA.
- Between modules, e.g. network protocol, I/O channel or bus
- Standard representation, e.g. IEEE floating point, ASCII

Implementation

Our focus is on microarchitecture, the idea of designing the CPU given a particular architecture.

Implement the instruction set
- Provide the functionality necessary to carry out program instructions.
Exploit capabilities of technology
- Locality
- Concurrency
Iterative process
- Generate proposed architecture
- Estimate cost
- Measure performance (through simulation)
Current emphasis is on overcoming sequential nature of programs.
- Deep pipelining
- Multiple issue
- Dynamic scheduling (out-of-order)
- Speculation

Application Constraints

Applications drive machine "balance."

Numerical simulations
- Floating point performance
- Memory bandwidth
Transaction processing
- I/O per second.
- integer CPU performance.
Decision support
- I/O bandwidth
Embedded control
- I/O timing, deterministic behavior
Media processing
- low-precision "pixel" processing
- floating point performance

Trends in Computer Architecture

Moore's Law. Number of transistors per integrated circuit doubles every 18 months. This has been true since the middle of the 60's.
A corollary is that clock rates seem to double with roughly the same frequency; however, these increases are due to more than just device scaling.
Let's look at a timeline of Intel processors for an example of the trends in architecture and process technology:
- 1971. Intel 4004. 2300 transistors. 10 micron process. 108KHz clock rate. 4-bit processor.
- 1972. Intel 8008. 3500 transistors. 10 microns. 200KHz. 8-bit processor.
- 1974. Intel 8080. 6000 transistors. 6 microns. 2MHz. Popular 8-bitter, copied and refined with Z80, 8085.
- 1978. Intel 8086/8088. 29,000 transistors. 3 microns. 4.77-10MHz. Appeared in original IBM PCs.
- 1982. Intel 80286. 134,000 transistors. 1.5 microns. 12.5-20MHz.
- 1985. Intel 80386DX. 275,000 transistors. 1 micron process. 33MHz.
- 1989. Intel 80486DX, 1,200,000 transistors. 0.8 microns (DX4 at 0.6 microns). 25-100MHz. Built-in floating point, L1 cache, 6-stage pipeline.
- 1993. Intel Pentium. 3,100,000 transistors. 0.5 microns. 60-133MHz. Superscalar, dual-issue pipeline.
- 1995. Intel Pentium Pro. 5,500,000 transistors. 0.35 micron. 200MHz.
- 1997. Intel Pentium II. 7,500,000 transistors. 0.35 micron (Celeron at 0.25). 233-333MHz.
- 2000. Intel Pentium III, 28,000,000 transistors. 0.18 micron (CuMine). 733MHz-1.2GHz.
- 2001. Intel Pentium 4, 55,000,000 transistors. 0.18, now 0.13 micron process. up to 3.2 GHz.
- 2003 Pentium M "Banias" core, 77,000,000 transistors, 0.13 micron, 1.3 GHz (what happened?)
- 2006 Pentium Core Duo, 291,000,000 transistors, 0.065 micron (65 nm), up to 2.33 GHz
- 2007 Pentium Core Quad....
What are current concerns in computer architecture?
- Biggest trend: how to effectively use multiple cores.
- Higher performance (as always).
- Memory wall. The differences in CPU speeds and memory speeds is the motivation for a lot of recent research.
- Complexity. How do we manage the complexity of a device that contains millions and soon billions of transistors?
- Verification. How can we verify that our increasingly complex designs are correct?
- Wire delay. As gates shrink, gate delay decreases faster than wire delay. Aggressive clocking amplifies this effect. How do we build CPUs in which it may take several cycles for a signal to cross the chip?
- Power. Many aspects to the power problem. How do we build high-performance systems with reasonable energy consumption? How do we build small (e.g. handheld) devices with constraints on energy?
- Hardware/software interface. What is the right way for the software and hardware to communicate? As microarchitectures become more complex, how can the compiler effectively optimize code?
- Reliability, fault-tolerance.

Metrics

How and what do we measure in computer architecture?

Benchmarking
- Macrobenchmarks & suites. Realistic programs from the "real world" that we use to measure program behavior.
  - Measure execution time.
  - Characterize execution time in terms of number of events
  - Number of instructions executed, cache misses, page faults, branch mispredictions, etc.
  - Number of loads, stores, arithmetic, FP, etc. instructions
  - Number of instructions executed per cycle on average (IPC)
  - Cache hit rate, e.g., 95%, 99%
  - Well-known SPEC (Standard Performance Evaluation Corporation) CPU benchmarks
- Microbenchmarks. Measure a single aspect of performance, designed to elicit and quantify a certain behavior.
- Traces. Replay recorded accesses, e.g. cache, branch, register.
Simulation at many levels.
- ISA. Can measure instruction counts and relative frequencies, basic block counts, hot-path information. Can get good estimate of memory system behavior. A related area is emulation.
- Cycle accurate. Simulate the high-level organization of a microarchitecture. Can measure IPC, speculation events, more accurate memory system performance, resource utilization. Cycle-accurate simulators provide flexibility with reasonable degree of accuracy.
- RTL. Register Transfer Level. RTL is like the program that specifies exactly what the microarchitecture must do. Things like SRAM technology still not specified, but much more accurate than cycle-accurate. Less flexible, more work.
- Gate-level. ANDs, ORs, etc. Very slow simulation, but very accurate. Good estimate of fan-in, fan-out. Rough approximation of circuit timing. Layout can be specified.
- Circuit level. Very very slow. Simulate the physics involved in switching. Transistors are individually simulated. Sizes and sometimes positions of transistors can be specified. Process technology is taken into account at this level. Can be used to test components of the microarchitecture, e.g., an array of SRAM cells, a functional unit, etc. Useful for getting exact timing and power measurements.
Area and delay estimation. This gets into VLSI, but the architect must be aware of area, delay, and power constraints in the underlying technology.
Analysis, e.g. queueing theory. Sometimes we can analytically evaluate a computer structure.

If I reach this point in the lecture and still have lots of time left over, I will launch into an impromptu and very painful lecture on digital logic design.

For next time, read Chapters 1 and 2 of the book and do Homework 1 .

CS 5513: Computer Architecture Lecture 1: Discussion, Circuits, Metrics, Goals