Lecture 14

Containers

Let's talk about containers, i.e., data structures where we want to support the following operations: Let's restrict the discussion to containers that store sets, i.e., we can assume that all of the keys are distinct.

There are many implementations of containers. For instance, we could use linked lists, or various kinds of trees. Today we'll see a special kind of container called a hash table. First, let's recall a simple data structure that acts as a container: binary search trees.

Binary Search Trees

A binary search tree is a binary tree where, for each node, the key of the right child is greater than the key of the node, and the key of the left child is less than the key of the node (we can ignore the case where a child is equal to its parent because we are discussing storing sets, which have distinct items). Let's take a look at some C code for implementing binary search trees with integer keys:
struct tree_node {
	int	k;
	struct tree_node *left, *right;
};

void tree_insert (struct tree_node **r, int k) {
		if (*r == NULL) {
			*r = malloc (sizeof (struct tree_node));
			(*r)->left = NULL;
			(*r)->right = NULL;
			(*r)->k = k;
		} else if ((*r)->k < k) tree_insert (&((*r)->right), k);
		else if ((*r)->k > k) tree_insert (&((*r)->left), k);
		else ; // duplicate key
	}

int tree_search (struct tree_node *r, int k) {
	if (!r) return 0;
	if (r->k < k) return tree_search (r->right, k);
	if (r->k > k) return tree_search (r->left, k);
	return 1;
}
(Remove is left as an exercise for the reader.)

Analysis

What are the running times of Insert and Search? Let us suppose, in the best case, that the sequence of insertions results in an almost complete binary tree. Then, if there are n elements in the tree, adding another leaf node takes an amount of time proportional to the number of resursive calls to Insert. This is the depth of a path in the tree, or O(log n). In practice, the distribution of the data will greatly affect the running time, but we should see logarithmic behavior with uniformly randomly distributed data. (We can look at more sophisticated trees that maintain balance, but this is beyond the scope of the current lecture.)

The analysis of Search is similar. Search traces a path from the root to the sought node, or to some leaf if the node isn't there. In the worst case, this is a path from root to leaf, or O(log n) with an almost complete binary tree.

For completeness let us briefly mention the possible pathological behaviors of a binary search tree. Suppose there is a long run of mostly sorted data inserted into the tree; then most of the insertions will go to the left, and the tree will have linear instead of logarithmic height. Further insertions and searches will then also have linear behavior. However, as mentioned before, there are solutions to this problem that are beyond the scope of the current lecture.

Logarithmic behavior is very good, and we can solve the pathological cases, so what else could we ask from a container class?

Hash Tables

How about constant time access instead of logarithmic time? Hash tables are a data structure that gives us just that. The idea is very simple: compute a function of the key and stick the item in an array with that index. The function is called a hash function because it is intended to scramble the data so that it looks uniformly randomly distributed in the space of possible array indices.

The Basic Idea

For a hash table, we need the following things: To insert a key k into the hash table, we let i = hash(k), then place k into the ith element of A.

To search for a key k in the hash table, we again let i = hash(k) and then look for k in the ith element of A.

If the hash function takes constant time to compute, then clearly both Insert and Search should take O(1) time since all they are doing is computing a constant function and accessing a single array element.

There's just one problem: what if two keys we insert both hash to the same array index? Then we have a collision. There are two main methods for dealing with collisions:

Collisions can be costly, but with a good hash function that distributes keys uniformly across the array, their impact can be minimized.

Hashing with Chaining

The chains here refer to the containers at each array element. Imagine sorting mail into a number of chains labeled with different ZIP codes. Let's look at some C code for implementing a hash table of integers with chains:
/* here is code for implementing linked lists */
struct list_node {
	int	k;
	struct list_node *next;
};

void list_insert (struct list_node **l, int k) {
	struct list_node *p = malloc (sizeof (struct list_node));
	p->k = k;
	p->next = *l;
	*l = p;
}

int list_search (struct list_node *l, int k) {
	if (!l) return 0;
	if (l->k == k) return 1;
	return list_search (l->next, k);
}

/* here is code for the hash table with chaining */
struct hash_table {
	int	nlists;
	struct list_node **table;
};

void hash_table_init (struct hash_table *t, int nlists) {
	int	i;

	t->nlists = nlists;
	t->table = malloc (nlists * sizeof (struct list_node *));
	for (i=0; i<nlists; i++) t->table[i] = NULL;
}

unsigned int hash (int k, int n) {
	return ((k * 233) ^ k) % (unsigned int) n;
}

void hash_table_insert (struct hash_table *t, int k) {
	unsigned int h = hash (k, t->nlists);
	list_insert (&(t->table[h]), k);
}

int hash_table_search (struct hash_table *t, int k) {
	int h = hash (k, t->nlists);
	return list_search (t->table[h], k);
}

Analysis

If we assume that n = O(n) and that the hash function distributes keys uniformly across the space of array indices, then each chain will have O(n) keys divided by O(n) chain = O(1) elements. So Insert and Search operate on O(1) elements and thus take constant time. What about the worst case?

Let's look at a graph of the average access time per element of a program that inserts n numbers into a container and then performs n searches on each of the inserted numbers. We'll look at binary search trees, red-black trees, and hash tables:

graph

Read Chapter 16 and 18.