Lecture 14
Containers
Let's talk about containers, i.e., data structures where we want to
support the following operations:
- Insert(S, k) - inserts data with key k into S
- Search(S, k) - tests whether data with key k is in S, i.e. returns true if k is in S, false
otherwise.
- Remove(S, k) - remove data with key k
from S
Let's restrict the discussion to containers that store sets, i.e.,
we can assume that all of the keys are distinct.
There are many implementations of containers. For instance, we could use
linked lists, or various kinds of trees. Today we'll see a special kind
of container called a hash table. First, let's recall a simple
data structure that acts as a container: binary search trees.
Binary Search Trees
A binary search tree is a binary tree where, for each node, the key of the
right child is greater than the key of the node, and the key of the left
child is less than the key of the node (we can ignore the case where a
child is equal to its parent because we are discussing storing sets, which
have distinct items). Let's take a look at some C code for implementing
binary search trees with integer keys:
struct tree_node {
int k;
struct tree_node *left, *right;
};
void tree_insert (struct tree_node **r, int k) {
if (*r == NULL) {
*r = malloc (sizeof (struct tree_node));
(*r)->left = NULL;
(*r)->right = NULL;
(*r)->k = k;
} else if ((*r)->k < k) tree_insert (&((*r)->right), k);
else if ((*r)->k > k) tree_insert (&((*r)->left), k);
else ; // duplicate key
}
int tree_search (struct tree_node *r, int k) {
if (!r) return 0;
if (r->k < k) return tree_search (r->right, k);
if (r->k > k) return tree_search (r->left, k);
return 1;
}
(Remove is left as an exercise for the reader.)
Analysis
What are the running times of Insert and Search? Let us
suppose, in the best case, that the sequence of insertions results in an
almost complete binary tree. Then, if there are n elements in
the tree, adding another leaf node takes an amount of time proportional
to the number of resursive calls to Insert. This is the depth
of a path in the tree, or O(log n). In practice,
the distribution of the data will greatly affect the running time, but we
should see logarithmic behavior with uniformly randomly distributed data.
(We can look at more sophisticated trees that maintain balance, but this
is beyond the scope of the current lecture.)
The analysis of Search is similar. Search traces a path
from the root to the sought node, or to some leaf if the node isn't there.
In the worst case, this is a path from root to leaf, or O(log
n) with an almost complete binary tree.
For completeness let us briefly mention the possible pathological behaviors
of a binary search tree. Suppose there is a long run of mostly sorted
data inserted into the tree; then most of the insertions will go to
the left, and the tree will have linear instead of logarithmic height.
Further insertions and searches will then also have linear behavior.
However, as mentioned before, there are solutions to this problem that
are beyond the scope of the current lecture.
Logarithmic behavior is very good, and we can solve the pathological cases,
so what else could we ask from a container class?
Hash Tables
How about constant time access instead of logarithmic time? Hash tables
are a data structure that gives us just that. The idea is very simple:
compute a function of the key and stick the item in an array with that index.
The function is called a hash function because it is intended to
scramble the data so that it looks uniformly randomly distributed in the
space of possible array indices.
The Basic Idea
For a hash table, we need the following things:
- An array A[0..m-1]. m is a large integer,
much larger than the number of keys n we expect to store (e.g.
m could be 10 times n).
- A function hash(k) that returns values from 0 to m-1.
If i=hash(k), then we say that k
hashes to the value i.
To insert a key k into the hash table, we let i =
hash(k), then place k into the ith
element of A.
To search for a key k in the hash table, we again let i =
hash(k) and then look for k in the ith
element of A.
If the hash function takes constant time to compute, then clearly both
Insert and Search should take O(1) time since
all they are doing is computing a constant function and accessing a single
array element.
There's just one problem: what if two keys we insert both hash to the same
array index? Then we have a collision. There are two main methods
for dealing with collisions:
- Hashing with chaining. Each array element is actually a container
itself, e.g. a linked list, that can accomodate multiple keys.
- Hashing with open addressing. Each array element can either store
a key or be considered empty. When an insertion attempts to insert into
an array element that already contains a key, another element is chosen
according to some algorithm. A common method is linear probing,
where the index returned by the hash function is incremented repeatedly
(modulo m) until an empty element is found.
Collisions can be costly, but with a good hash function that distributes
keys uniformly across the array, their impact can be minimized.
Hashing with Chaining
The chains here refer to the containers at each array element.
Imagine sorting mail into a number of chains labeled with different
ZIP codes. Let's look at some C code for implementing a hash table of
integers with chains:
/* here is code for implementing linked lists */
struct list_node {
int k;
struct list_node *next;
};
void list_insert (struct list_node **l, int k) {
struct list_node *p = malloc (sizeof (struct list_node));
p->k = k;
p->next = *l;
*l = p;
}
int list_search (struct list_node *l, int k) {
if (!l) return 0;
if (l->k == k) return 1;
return list_search (l->next, k);
}
/* here is code for the hash table with chaining */
struct hash_table {
int nlists;
struct list_node **table;
};
void hash_table_init (struct hash_table *t, int nlists) {
int i;
t->nlists = nlists;
t->table = malloc (nlists * sizeof (struct list_node *));
for (i=0; i<nlists; i++) t->table[i] = NULL;
}
unsigned int hash (int k, int n) {
return ((k * 233) ^ k) % (unsigned int) n;
}
void hash_table_insert (struct hash_table *t, int k) {
unsigned int h = hash (k, t->nlists);
list_insert (&(t->table[h]), k);
}
int hash_table_search (struct hash_table *t, int k) {
int h = hash (k, t->nlists);
return list_search (t->table[h], k);
}
Analysis
If we assume that n = O(n) and that the hash
function distributes keys uniformly across the space of array indices,
then each chain will have O(n) keys divided by
O(n) chain = O(1) elements. So Insert
and Search operate on O(1) elements and thus take
constant time. What about the worst case?
Let's look at a graph of the average access time per element of a program
that inserts n numbers into a container and then performs n
searches on each of the inserted numbers. We'll look at binary search
trees, red-black trees, and hash tables:
Read Chapter 16 and 18.