CS 5513: Computer Architecture
Lecture 1: Discussion, Circuits, Metrics, Goals
Discussion
Let's have a discussion about performance. We'll begin with an example
designed to show you where computer architecture can be relevant to you
as a programmer.
Our objective is a favorite toy example of mine. Let's write a program
that counts the prime numbers between 0 and 10,000,000. We'd like a
fast program, so we'll go through a few ideas. We'll do this in Java.
If you don't know Java, just pretend it is C or C++. If you don't know
any of these, learn one as soon as possible.
- First idea: test each number for primality using a little function
class primes {
static final int N = 10000000;
public static void main (String args[]) {
int i, j, l;
// count the number of integers from 2..N that are prime
for (j=0,i=2; i<=N; i++)
if (is_prime (i)) j++;
System.out.println (j);
}
// return true if and only if n is prime
static boolean is_prime (int n) {
int i, l;
if (n == 2) return true;
// trial division by up to square root of n
l = (int) Math.sqrt (n) + 1;
for (i=2; i<=l; i++) if (n % i == 0) return false;
// must be prime
return true;
}
}
This takes 53.6 seconds on my laptop, a Pentium 4M running at 2.0MHz.
- Second idea: Modify the little function to recognize that we
only need to divide by 2 and odd numbers to check for primality.
class primes2 {
static final int N = 10000000;
public static void main (String args[]) {
int i, j, l;
// count the number of integers from 2..N that are prime
for (j=0,i=2; i<=N; i++)
if (is_prime (i)) j++;
System.out.println (j);
}
// return true if and only if n is prime
static boolean is_prime (int n) {
int i, l;
if (n == 2 || n == 3) return true;
// even numbers > 2 aren't prime
if (n % 2 == 0) return false;
// trial division by odd numbers up to square root of n
l = (int) Math.sqrt (n) + 1;
for (i=3; i<=l; i+=2) if (n % i == 0) return false;
// must be prime
return true;
}
}
This takes 26.7 seconds, a speedup of 2.0
- Third idea: Modify the little function to only divide by numbers
that are of the form 6i+/- 1; it's easy to show that only these
numbers (and 2 and 3) can be prime. Let's also only consider numbers of
these form when counting.
class primes3 {
static final int N = 10000000;
public static void main (String args[]) {
int i, j, l;
// count the number of integers from 2..N that are prime
j = 0;
// count first three primes
if (N >= 2) j++;
if (N >= 3) j++;
// count all primes, considering only numbers of the
// form 6 * i +/- 1
for (i=6; i<=N; i+=6) {
if (is_prime (i + 1)) j++;
if (is_prime (i - 1)) j++;
}
System.out.println (j);
}
// return true if and only if n is prime
static boolean is_prime (int n) {
int i, l;
// 2 and 3 are prime
if (n == 2 || n == 3) return true;
// divisible by 2 or 3? not prime.
if (n % 2 == 0 || n % 3 == 0) return false;
// trial division by up to square root of n,
// only numbers of the form 6 * i +/- 1
l = (int) Math.sqrt (n) + 1;
for (i=6; i<=l; i+=6) {
if (n % (i + 1) == 0) return false;
if (n % (i - 1) == 0) return false;
}
// must be prime
return true;
}
}
This takes 18.0 seconds, a speedup of 3.0 over the original and 1.5 over
the second one. Wow, that's a lot better, but we're about out of ideas at
this point. We could extend the idea to only test for numbers of the form
30i +/- 1, 7, 11, and 13, but that's getting silly. What are the
time complexities of the first three algorithms? O(N *
sqrt(N)). We've just been improving the constant factor. What we
need is an improvement in the algorithm. Let's use Eratosthenes' sieve.
class sieve1 {
static final int N = 10000000;
static boolean A[];
public static void main (String args[]) {
int i, j, l;
A = new boolean[N+1];
// do a sieve of Eratosthenes
for (i=0; i<=N; i++) A[i] = true;
l = (int) Math.sqrt (N);
// for each number i from 2 to square root of N...
for (i=2; i<=l; i++)
// ...mark off all the multiples of i
for (j=i*i; j<=N; j+=i) A[j] = false;
// count whatever is left; these are all the primes
for (i=2,j=0; i<=N; i++) if (A[i]) j++;
System.out.println (j);
}
}
This program is very simple. It doesn't take advantage of odd numbers or any
weird properties of integers (although it could if we tweaked it). However,
it takes only 4.79 seconds. A speedup of 11.2 over the original and 3.75
over the previous. Why is this so much faster? Because we improved the
algorithm, not just the constant factor. This algorithm has a running time
of O(N log(sqrt(N))). (Why? First see that the
running time is O(sum_{i=2}^sqrt(N) N/i).
Now factor out the N; what's left is a harmonic series going up
to sqrt(N) and missing the unit term, thus it is bounded from
above by log(sqrt(N))). By now you're thinking, to heck with
all this math, let's see some computer architecture! OK, here you go:
class sieve2 {
static final int N = 10000000;
static final int BLOCKSIZE = 65536;
static boolean A[];
public static void main (String args[]) {
int i, j, k, l, count, base, range;
A = new boolean[N+1];
// do a blocked sieve of Eratosthenes
count = 0;
// go through the numbers from 2..N by BLOCKSIZE
for (base=2; base<=N; base+=BLOCKSIZE) {
for (i=0; i<BLOCKSIZE; i++) A[i] = true;
// find all the primes in base..base+BLOCKSIZE-1
range = base + BLOCKSIZE;
if (range > N) range = N;
l = 1 + (int) Math.sqrt (range);
for (i=2; i<=l; i++) {
// let k be the first multiple of i
// greater than base that isn't i
k = base + (i-base%i)%i;
if (k == i) k += i;
// mark off all multiples of i starting at k
for (j=k; j<range; j+=i) A[j-base] = false;
}
// count up what we have so far
for (i=base; i<range; i++) if (A[i-base]) count++;
}
System.out.println (count);
}
}
This takes 0.64 seconds, a speedup of 83.8 over the original and 7.5 over
the previous. This program is adapted from the original sieve, but instead
of sieving over all the numbers from 2..N at once, it does it in
blocks of 65,536. It does some extra computations, actually executing quite
few more instructions than the original sieve. So why is it way faster,
delivering a much better speedup than all the other ones? Can you think
of ways to make it even faster? Let's discuss.
The point is, having some knowledge of the microarchitecture that is
implementing your program can help you make it execute more efficiently.
However, you must be careful: make sure your algorithm is good before
trying to tweak it to use the microarchitecture more efficiently.
Premature optimization is the root of all evil. We could have tried
optimizing the heck out of the first prime counting function, but we would
never have been able to get it to the speed of the blocked sieve because
of the innate inefficiency of the algorithm.
CS 5513
CS 5513, Computer Architecture, is a graduate level class on computer
architecture. Let's talk about computer architecture. Here are some
questions to think about:
- What does "computer architecture" mean?
- What are some examples of computer architecture you are familiar with?
- What are some of the issues and changes in computer architecture?
- Describe the computer you first learned to program or use. How does
it compare with the computer(s) you use today?
- Is it important to make computers go faster? Why or why not?
- How can a system composed of lots of transistors do things like
arithmetic?
Administrivia
Class web page is
http://www.cs.utsa.edu/~dj/cs5513/index.html.
A syllabus is available there.
Circuits
Speaking of transistors, Our focus will be on higher-level issues, but we
must always keep in mind that these ideas have some sort of physical basis.
There are three main types of components we are interested in:
- Transistors (computing).
- Wires (transporting).
- Memories (storing).
Voltage is used to represent logical values 0 and 1. For us, it doesn't
matter what the voltages are, just that there are two distinguishable
voltages that we can simply name 0 and 1. Sometimes it helps to have 0
be equal to 0 volts.
Transistors
Transistors are tiny switches. They have three terminals:
- Gate (G)
- Drain (D)
- Source (S)
When the gate terminal is triggered, current is allowed to flow from
the source to the drain. Otherwise, no current flows from the source to
the drain.
There are two types of MOS transistors used in the CMOS technology that
dominates current computer architecture:
- NMOS - triggered when 1 is appplied to G.
D
_|
G --||_
|
S
NMOS transistors are good at passing 0's, but not good at passing 1's.
- PMOS - triggered when 0 is applied to G.
D
_|
G -o||_
|
S
PMOS transistors are good at passing 1's, but not 0's.
How do we compute with these devices? We'd like to be able to work
with AND, OR, NOT, etc., not with "turn on" and "bad at passing 0."
For example, consider the NOT function:
in out
-- ---
0 1
1 0
Here is an example of a NOT gate (or inverter) implemented with
transistors:
1
_|
|-o||_
___in___| |__out___
| _|
|--||_
|
0
How about a more complex function? The NAND function is important in
computer architecture for a number of reasons. Why? The truth table is:
a b x
- - -
0 0 1
0 1 1
1 0 1
1 1 0
How would we implement this in transistors?
Goals for the class
- Understand the how and why of computer architectures.
- Instruction Set Architecture
- System organization (processor, memory, I/O)
- Microarchitecture
- Learn methods of measuring and improving performance
- Metrics
- Benchmarks
- Performance methods (pipelining, ILP, prediction)
- Discover the state of the art in computer architecture
What is Computer Architecture?
- Interface design.
- Implementation. Issues related to the physical realization of the
defined ISA.
- Organization. High-level aspects of computer design, such
as the high-level design of the memory system, I/O, and CPU.
- Hardware. Low-level aspects of the implementation. Gate-level
and transistor-level design. Packaging.
Interface Design
- An interface may last through several implementations (e.g. x86 ISA).
- Instruction Set Architecture is an important interface: Defining the
interface between the software that provides instructions and the hardware
that implements them.
- Interfaces are visible. Implementations aren't directly
visible.
- Three types of interfaces:
- Between layers, e.g. API, ISA.
- Between modules, e.g. network protocol, I/O channel or bus
- Standard representation, e.g. IEEE floating point, ASCII
Implementation
Our focus is on microarchitecture, the idea of designing the CPU
given a particular architecture.
- Implement the instruction set
- Provide the functionality necessary to carry out program
instructions.
- Exploit capabilities of technology
- Iterative process
- Generate proposed architecture
- Estimate cost
- Measure performance (through simulation)
- Current emphasis is on overcoming sequential nature of programs.
- Deep pipelining
- Multiple issue
- Dynamic scheduling (out-of-order)
- Speculation
Application Constraints
Applications drive machine "balance."
- Numerical simulations
- Floating point performance
- Memory bandwidth
- Transaction processing
- I/O per second.
- integer CPU performance.
- Decision support
- Embedded control
- I/O timing, deterministic behavior
- Media processing
- low-precision "pixel" processing
- floating point performance
Trends in Computer Architecture
- Moore's Law. Number of transistors per integrated circuit doubles
every 18 months. This has been true since the middle of the 60's.
- A corollary is that clock rates seem to double with roughly the same
frequency; however, these increases are due to more than just device scaling.
- Let's look at a timeline of Intel processors for an example of the
trends in architecture and process technology:
- 1971. Intel 4004.
2300 transistors.
10 micron process.
108KHz clock rate.
4-bit processor.
- 1972. Intel 8008.
3500 transistors.
10 microns.
200KHz.
8-bit processor.
- 1974. Intel 8080.
6000 transistors.
6 microns.
2MHz.
Popular 8-bitter, copied and refined with Z80, 8085.
- 1978. Intel 8086/8088.
29,000 transistors.
3 microns.
4.77-10MHz.
Appeared in original IBM PCs.
- 1982. Intel 80286.
134,000 transistors.
1.5 microns.
12.5-20MHz.
- 1985. Intel 80386DX.
275,000 transistors.
1 micron process.
33MHz.
- 1989. Intel 80486DX,
1,200,000 transistors.
0.8 microns (DX4 at 0.6 microns).
25-100MHz.
Built-in floating point, L1 cache, 6-stage pipeline.
- 1993. Intel Pentium.
3,100,000 transistors.
0.5 microns.
60-133MHz.
Superscalar, dual-issue pipeline.
- 1995. Intel Pentium Pro.
5,500,000 transistors.
0.35 micron.
200MHz.
- 1997. Intel Pentium II.
7,500,000 transistors.
0.35 micron (Celeron at 0.25).
233-333MHz.
- 2000. Intel Pentium III,
28,000,000 transistors.
0.18 micron (CuMine).
733MHz-1.2GHz.
- 2001. Intel Pentium 4,
55,000,000 transistors.
0.18, now 0.13 micron process.
up to 3.2 GHz.
- 2003 Pentium M "Banias" core, 77,000,000 transistors, 0.13 micron, 1.3 GHz (what happened?)
- 2006 Pentium Core Duo, 291,000,000 transistors, 0.065 micron (65 nm), up to 2.33 GHz
- 2007 Pentium Core Quad....
- What are current concerns in computer architecture?
- Biggest trend: how to effectively use multiple cores.
- Higher performance (as always).
- Memory wall. The differences in CPU speeds and memory speeds
is the motivation for a lot of recent research.
- Complexity. How do we manage the complexity of a device that
contains millions and soon billions of transistors?
- Verification. How can we verify that our increasingly complex
designs are correct?
- Wire delay. As gates shrink, gate delay decreases faster
than wire delay. Aggressive clocking amplifies this effect.
How do we build CPUs in which it may take several cycles for a
signal to cross the chip?
- Power. Many aspects to the power problem. How do we build
high-performance systems with reasonable energy consumption? How do
we build small (e.g. handheld) devices with constraints on energy?
- Hardware/software interface. What is the right way for the
software and hardware to communicate? As microarchitectures become
more complex, how can the compiler effectively optimize code?
- Reliability, fault-tolerance.
Metrics
How and what do we measure in computer architecture?
- Benchmarking
- Macrobenchmarks & suites. Realistic programs from the "real world" that we use to measure program behavior.
- Measure execution time.
- Characterize execution time in terms of number of events
- Number of instructions executed, cache misses, page faults, branch mispredictions, etc.
- Number of loads, stores, arithmetic, FP, etc. instructions
- Number of instructions executed per cycle on average (IPC)
- Cache hit rate, e.g., 95%, 99%
- Well-known SPEC (Standard Performance Evaluation Corporation) CPU benchmarks
- Microbenchmarks. Measure a single aspect of performance, designed
to elicit and quantify a certain behavior.
- Traces. Replay recorded accesses, e.g. cache, branch, register.
- Simulation at many levels.
- ISA. Can measure instruction counts and relative frequencies,
basic block counts, hot-path information. Can get good estimate
of memory system behavior. A related area is emulation.
- Cycle accurate. Simulate the high-level organization
of a microarchitecture. Can measure IPC, speculation events,
more accurate memory system performance, resource utilization.
Cycle-accurate simulators provide flexibility with reasonable
degree of accuracy.
- RTL. Register Transfer Level. RTL is like the program that
specifies exactly what the microarchitecture must do. Things like
SRAM technology still not specified, but much more accurate than
cycle-accurate. Less flexible, more work.
- Gate-level. ANDs, ORs, etc. Very slow simulation, but very
accurate. Good estimate of fan-in, fan-out. Rough approximation
of circuit timing. Layout can be specified.
- Circuit level. Very very slow. Simulate the physics
involved in switching. Transistors are individually simulated.
Sizes and sometimes positions of transistors can be specified.
Process technology is taken into account at this level. Can be
used to test components of the microarchitecture, e.g., an array
of SRAM cells, a functional unit, etc. Useful for getting exact
timing and power measurements.
- Area and delay estimation. This gets into VLSI, but the architect must
be aware of area, delay, and power constraints in the underlying technology.
- Analysis, e.g. queueing theory. Sometimes we can analytically evaluate
a computer structure.
If I reach this point in the lecture and still have lots of time left
over, I will launch into an impromptu and very painful lecture on digital
logic design.
For next time, read Chapters 1 and 2 of the book and do Homework 1 .