A greedy algorithm builds a solution by going one step at a time through the data, taking the best from a small number of choices each time it has an opportunity.
dead beef cafe deeded dad. dad faced a faded cab. dad acceded. dad be bad.There are 12 a's, 4 b's, 5 c's, 19 d's, 12 e's, 4 f's, 17 spaces, and 4 periods, for a total of 77 characters. If we use a fixed-length code like this:
000 (space) 001 a 010 b 011 c 100 d 101 e 110 f 111 .Then the sentence, which is of length 77, consumes 77 * 3 = 231 bits. But if we use a variable length code like this:
100 (space) 110 a 11110 b 1110 c 0 d 1010 e 11111 f 1011 .Then we can encode the text in 3 * 12 + 4 * 5 + 5 * 4 + 19 * 1 + 12 * 4 + 4 * 5 + 17 * 3 + 4 * 4 = 230 bits. That a savings of 1 bit. It doesn't seem like much, but it's a start. (Note that such a code must be a prefix code, where we can distinguish where one code stops and another starts; one code may not be a prefix of another code or there will be confusion.)
If the characters have non-uniform frequency distributions, then finding such a code can lead to great savings in storage space. A process with this effect is called data compression. This can be applied to any data where the frequency distribution is known or can be computed, not just sentences in languages. Examples are computer graphics, digitized sound, binary executables, etc.
A prefix code can be represented as a binary tree, where the leaves are the characters and the codes are derived by tracing a path from root to leaf, using 0 when we go left and 1 when we go right. For example, the code above would be represented by this tree: <
_@_ _/ \_ _/ \_ _/ \_ _/ \_ _/ \_ _/ \_ d _@_ / \ / \ / \ / \ / \ _@_ _@_ / \ / \ / \ / \ (space) @ a @ / \ / \ / \ / \ e "." c @ / \ b fIn this tree, the code for e is found by going right, left, right, left, i.e., 1010.
How can we find such a code? There are many codes, but we would like to find one that is optimal with respect to the number of bits needed to represent the data. Huffman's Algorithm is a greedy algorithm that does just this.
We can label each leaf of the tree with the frequency of the letter in the text to be compressed. This quantity will be called the "value" of the leaf. The frequencies may be known beforehand from studies of the language or data, or can be computed by just counting characters the way counting sort does.
We then label each internal node recursively with the sum of the values of its children, starting at the leaves. So the tree in our example looks like this:
_77 _/ \_ _/ \_ _/ \_ _/ \_ _/ \_ _/ \_ d _58 19 / \ / \ / \ / \ / \ _33 _25 / \ / \ / \ / \ (space) 16 a 13 17 / \ 12 / \ / \ / \ e "." c 8 12 4 5 / \ b f 4 4The root node has value 77, which is just the number of characters.
The number of bits needed to encode the data is the the sum, for each character, of the number of bits in its code times its frequency. Let T be the tree, C be the set of characters c that comprise the alphabet, and f(c) be the frequency of character c. Since the number of bits is the same as the depth in the binary tree, we can express the sum in terms of dT, the depth of character c in the tree:
f(c) dT(c)This is the sum we want to minimize. We'll call it the cost, B(T) of the tree. Now we just need an algorithm that will build a tree with minimal cost.
c in C
In the following algorithm, f is defined as above; it can be stored efficiently in an array indexed by characters. f is extended as needed to accomodate the values of internal tree nodes. C is again the set of characters represented as leaf tree nodes. We have a priority queue Q of tree nodes where we can quickly extract the minimum element; this can be done with a heap where the heap property is reversed. We build the tree in a bottom up manner, starting with the individual characters and ending up with the root of the tree as the only element of the queue:
Huffman (C) n = the size of C insert all the elements of C into Q, using the value of the node as the priority for i in 1..n-1 do z = a new tree node x = Extract-Minimum (Q) y = Extract-Minimum (Q) left node of z = x right node of z = y f[z] = f[x] + f[y] Insert (Q, z) end for return Extract-Minimum (Q) as the complete treeAt first, the queue contains all the leaf nodes as a "forest" of singleton binary trees. Then the two nodes with least value are grabbed out of the queue, joined by a new tree node, and put back on the queue. After n-1 iterations, there is only one node left in the queue: the root of the tree.
Let's go through the above example using Huffman's algorithm. Here are the contents of Q after each step through the for loop:
(space) a b c d e f . 17 12 4 5 19 12 4 4
8 / \ (space) a b c d e f . 17 12 4 5 19 12 4 4
8 9 / \ (space) a / \ d e f . 17 12 b c 19 12 4 4 4 5
17 __/ \__ __/ \__ / \ 8 9 / \ / \ d e (space) a f . b c 19 12 17 12 4 4 4 5
17 __/ \__ __/ \__ / \ 24 8 9 / \ / \ / \ d e a (space) f . b c 19 12 12 17 4 4 4 5
34 ______/ \_____ ______/ \_____ / (space) 17 17 __/ \__ __/ \__ / \ 24 8 9 / \ / \ / \ e a d f . b c 12 12 19 4 4 4 5
34 ______/ \_____ ______/ \_____ / (space) 17 17 __/ \__ 43 __/ \__ / \ / \ 24 d 8 9 / \ 19 / \ / \ e a f . b c 12 12 4 4 4 5
77 __________________/ \_______ / \ 34 43 ______/ \_____ / \ ______/ \_____ 24 d / (space) / \ 19 17 17 e a __/ \__ 12 12 __/ \__ / \ 8 9 / \ / \ f . b c 4 4 4 5
So an optimal prefix code is:
01 (space) 101 a 0010 b 0011 c 11 d 100 e 0000 f 0001 .And B(T) = 17 * 2 + 12 * 3 + 4 * 4 + 5 * 4 + 19 * 2 + 12 * 3 + 4 * 4 = 196 bits, a savings of 15% in the size of the data.