CPSC 120 lecture notes for 4/15/98

CPSC 120 Lecture Notes, Wednesday, April 15, 1998

ASSIGNMENTS/ANNOUNCEMENTS

Friday will be review for Exam 2. HW solutions will be discussed. Come with questions.
Exam 2 is Monday, will NOT cover hashing. Review sheet is available.
Last lab will be in C.

GOOD HASH FUNCTIONS FOR CHAINING

We would like the hash function to spread out the data evenly among the M entries in the table.

Stated slightly more formally, any key should be equally likely to hash to any of the M locations.

In practice, we can't check that a hash function satisfies this condition, since the probability distribution on the keys is usually not known. For instance, if the hash table is implementing the symbol table in a compiler, the compiler writer (who also writes the symbol table) cannot know for sure what kind of variable names will appear in each program to be compiled.

So heuristics are used to approximate this condition: try something that seems reasonable, and run some experiments to see how it works out.

You might also want to use application-specific information. For the symbol table example, you might use information about the variable names that people often choose. For instance, it might be common for programs to have variables such x1, x2, x3, etc. You would want the hash function not to collide on these names.

In fact, ideally a hash function should depend on all the information in the keys. As a simple example, suppose the keys are words from an English text. If you choose M = 26 (one location for each letter of the alphabet) and the hash function returns the first letter of the word (minus 1), then all words beginning with S, of which there are MANY, would hash to the same location, but almost none would hash to the location for X.

The division method often approximates the desired condition: h(k) = k mod M, where M is a prime.

GOOD HASH FUNCTIONS FOR OPEN ADDRESSING

For open addressing, we would like an even stronger assumption about the hash function: each key should be equally likely to have each permutation of {0, 1, ..., M-1} as its probe sequence.

This is even harder to achieve in practice than the earlier condition.

A good approximation is double hashing with one of these two schemes:

let M be prime, then let h1(k) = k mod M and let h2(k) = 1 + k mod (M-1). OR
let M be a power of 2 and the increment be odd.

AVERAGE CASE ANALYSIS OF HASHING

The analysis of the average case time for insert, delete, and search in a hash table depends on the load factor of the hash table, which tells how full the table is.

The load factor of a hash table with M entries and N keys in it is defined to be lambda = N/M.

CHAINING

Assume that the hash function satisfies the property that any key is equally likely to hash to any of the M locations.

Fact: The average length of each linked list is N/M = lambda.

Notice that for chaining lambda can be either smaller than, equal to, or larger than 1. We will consider the case when N might be larger than M, but not too much larger. Notice that as long as N is O(M), then lambda is O(1). But if N gets larger, for instance if N = M^2, then lambda is O(N), which is not constant.

Insert: Average time is O(1) (same as worst case), since you just compute h and then insert at the beginning of a linked list.

Unsuccessful Search: Average time is O(1 + lambda): O(1) time to compute h(k), then O(lambda) time (on the average) to scan the linked list at location h(k) until discovering that k is not in the hash table.

Successful Search: Average time is O(1 + lambda/2) (which is O(1 + lambda)): O(1) time to compute h(k). On the average, the key being sought will be in the middle of the linked list, so lambda/2 comparisons will be done until finding k.

Delete: This is essentially the same as successful search (assuming you never try to delete something that is not in the table).

OPEN ADDRESSING

In this situation, lambda is always less than 1 -- there cannot be more keys in the table than there are table entries, since keys are stored directly in the table.

Assume that the hash function ensures that each key is equally likely to have each permutation of {0, 1, ..., M-1} as its probe sequence.

Unsuccessful Search: O(1/(1-lambda))

Insert: Essentially same as unsuccessful search.

Successful Search: O((1/lambda)*ln(1/(1-lambda))), where ln is the natural log (base e = 2.7...).

Delete: Essentially same as successful search.

The reasoning behind these formulas requires more sophisticated probability than for chaining.

But we can do some simple sanity checks:

The time for searches should increase as the load factor increases.

For unsuccessful search: as N gets closer to M, lambda gets closer to 1, so 1-lambda gets closer to 0, so 1/(1-lambda) gets larger. At the extreme, when N = M-1, the formula 1/(1-lambda) = M, meaning that you will search the entire table before discovering that the key is not there.