More Arrays

Sorting

It is often useful to put a list of items into some kind of order to make searching easier. For instance, the telephone book is arranged in alphabetical order to facilitate looking up someone's number by their name. Can you imagine how hard it would be to use the phone book if it were not in order? The process of putting items into order is called sorting, and it can be done in several different ways inside a computer.

Let's look at a program that takes an unordered list of numbers from the standard input, puts them into an array, sorts the array, then prints the list in ascending numerical order. One way to write this program is to use a selection sort. This kind of sort searches the array for the minimum element, then swaps it with the #0 element. Then it searches for the minimum element from the #1 through the last element, swapping it with the #1 element, and so forth, until the whole array is in order:

#include <stdio.h>
#include <stdlib.h>

void selection_sort (float [], int);
int read_numbers (float [], int);
void swap (float [], int, int);
int find_minimum (float [], int, int);

/* main program */

int main () {
	int	i;
	float	w[1000], num;

	/* get some numbers into the array */

	num = read_numbers (w, 1000);

	/* sort them */

	selection_sort (w, num);

	/* print them */

	for (i=0; i<num; i++) printf ("%f\n", w[i]);
	exit (0);
}

/* this function reads up to 'max' floats into 'v' */

int read_numbers (float v[], int max) {
	int	i;

	i = 0;

	/* keep going until break */

	for (;;) {

		/* get a number */

		scanf ("%f", &v[i]);

		/* no more numbers?  we're done. */

		if (feof (stdin)) break;

		/* one more number */

		i++;

		/* don't want to overflow array */

		if (i >= max) {
			fprintf (stderr, "too many!\n");
			exit (1);
		}
	}

	/* 'i' is now the number of floats read in */

	return i;
}

/* this function does a selection sort on 'v' */

void selection_sort (float v[], int n) {
	int	i;

	for (i=0; i<n; i++)
		swap (v, i, find_minimum (v, i, n));
}

/* this function returns the index of the minimum element of v
 * from 'first' to 'last'
 */
int find_minimum (float v[], int first, int last) {
	int	i, mini;

	/* mini tracks the lowest known element; currently the first */

	mini = first;

	/* go through all the rest looking for a lower element */

	for (i=first+1; i<last; i++) if (v[i] < v[mini]) mini = i;

	return mini;
}

/* this function exchanges the 'i'th and 'j'th elements of 'v' */

void swap (float v[], int i, int j) {
	float	t;

	t = v[i];
	v[i] = v[j];
	v[j] = t;
}

Searching (again)

Sorting the array makes searching it much easier, both for humans and computers. The linear search we saw in an earlier class is much less efficient that a binary search that can be done on sorted data. Consider the following function that implements a linear search on an array of floats:

/* this function returns the index of an item in the array,
 * or -1 if the item isn't in the array
 */
int linear_search (float v[], int n, float target) {
	int	i;

	for (i=0; i<n; i++) if (v[i] == target) return i;
	return -1;
}

How many comparisons will be done during this search in terms of the size of the array, n? For a successful search, n/2 comparisons will be performed on average, since we expect to find a randomly located item about halfway through the array. For an unsuccessful search, all n elements must be compared. Suppose instead of floats we were searching for a name in the telephone book. There might be 500,000 names in the book, so n=500,000. Do you normally look through around 250,000 names before you find the number? No; since the book is in sorted order, you can use a more efficient search to cut out most of the search space. Similarly, by splitting the search space in two parts each time we do a comparison, we can drastically reduce the number of comparisons made in a binary search:

int binary_search (int v[], int n, int target) {
        /* assumes v[] is in ascending sorted order */
        int     first, middle, last;

	/* 'first' and 'last' keep track of the section of the
	 * array where we know (or suspect) target is 
	 */
        first = 0;
        last = n;
        while (last-1 > first) {

		/* find the middle of the section of the array
		 * between 'last' and 'first'
		 */
                middle = (first + last) / 2;
                if (v[middle] < target) 
			/* value is in "upper" half */
                        first = middle;
                else if (v[middle] > target) 
			/* value is in "lower" half */
                        last = middle;
                else
                        return middle;
        }

	/* didn't return anything?  then it must not be there. */

        return -1;
}

This looks a lot longer, but turns out to be much more efficient. The section of the array to be searched is decreased by a factor of two each time through the while loop, resulting in a logarithmic rather than linear number of comparisons performed. Consider the following table of number of comparisons for linear search and binary search for different sized arrays:

Size of Array	Linear Search (average case)	Binary Search
-------------	----------------------------	-------------
16 items	8 comparisons			4 comparisons
64 items	32 comparisons			6 comparisons
256 items	128 comparisons			8 comparisons
65536 items	32768 comparisons		16 comparisons
4000000000 	2000000000 comparisons		32 comparisons

Clearly, binary search is better than linear search, especially for large arrays. However, if the data is initially unsorted, we have to first sort it before we can use binary search. Sometimes it is worth it, sometimes it isn't. Look at selection sort above and figure out how many comparisons it takes. It takes a large fraction of n squared comparisons. There is a C function called qsort that takes only n log n comparisons, but that is still a lot if we are only going to perform a few searches. Modern database systems go to a lot of trouble to keep data in sorted order so that searches will be fast, but sorting the whole database doesn't need to be done.