---------------------
CSCE 121H - Fall 2019
Project 9 - Keywords
due: Fri Nov 15, 8:00am
---------------------

The objective of this assignment is to write a program to analyze a
text document and extract keywords.  This could be useful for
annotating/indexing a document collection, or for making Word Clouds.

For this project, you will need to use unordered_maps, also known as
hash tables.  The central step is to create a map to store the counts
of words that appear in a document.  Then one could sort it and print
out the most frequent words.

The problem is that the most frequent words are almost always
uninformative, uninterestng words, like 'and', 'a', and 'the'.  These
are known as "stop words".  One approach would be to try to filter
these out using a predefined list of common words.  Instead, we are
going to use a different approach.  Suppose the document comes from a
collection.  For example, suppose it is one of the 10 Books from
Aritstotle's Ethics.  Then we can use the frequencies of words in the
whole document collection as a reference.  A word like "the" occurs
frequently in Book 2, but it also occurs frequently in all the Books.  
This suggests that a different method for identifying meaningful,
representative words is to compute the ratio of the frequency of each
word in a particular document to the frequency of that word over the
whole collection.

Here is a particular implementaion of this idea.  For a word W in
document D, compute the number of times W occurs in D, N(W,D), and the
total number of words in the document, N(D).  Then do the same thing
for the whole document collection C - compute N(W,C) and N(C).
Next, compute the expected number of words in X based on C
as follows:

  freq(W,C) = N(W,C) / N(C)
  expect(W,D) = N(D) * freq(W,C)

Finally, you can compute the "relative enrichment" of 
the word W in document D compared to C as:

  enrich(W,D) = N(W,D) / expect(W,D)

You can then print out the words in the document sorted by enrichment.
However, if you try this literally, you might observe that the
keywords are not very compelling.  The most enriched words are often
words that occur infrequently overall.  An improvement is to add in
pseudo-counts to the enrichment ratio (to both the numerator and
denominator).  Here is the adjusted formula:

  enrich(W,D) = (N(W,D)+PC) / (expect(W,D)+PC)

Try setting PC to 5 and see if it makes the most enriched words
more interesting and representative (compared to PC=0).


About your program
------------------

Suppose your program is called 'keywords'.  If you call it with one
command line argument (a text document), it will read-in the document,
count all the words, and print them out in sorted order.

In order to combine counts for variants of the same word, here is some of the
processing you should apply:

1. convert all characters to lower-case (to get rid of capitalization)
1. remove digits (e.g. including years)
2. remove punctuation: ",.?;:'!()[]$/
3. note: keep hyphens (-)

This is not perfect, but it will be good enough.  It doesn't properly
handle contractions, and it does not handle variations like plurals of
nouns (e.g. 'dog' versus 'dogs') or past tense of verbs (e.g. 'say'
vs. 'said').  In real document analysis, this is called 'stemming',
but is rather complicated.

Here is an example (from the concatenation of all 10 Books):

> keywords Aristotle_Ethics.txt
6842 the 
4594 of 
3936 and 
3506 to 
3487 is 
2682 in 
1976 a 
1574 it 
1514 that 
1468 are 
1462 be 
1280 for 
1279 as 
1265 not 
1137 but 
1093 or 
1019 which 
...


If you now want to use this to analyze a particular document,
give the name of a second fileaname on the command line.

> words Aristotle_Ethics.txt Aristotle_Ethics_Book1.txt
WORD                  num(doc)  freq(doc)     num(all)  freq(all)     expected   ratio
goods               	22	0.002425	49	0.000424	3.8	3.0523
soul                	23	0.002535	54	0.000467	4.2	3.0309
happy               	25	0.002756	66	0.000571	5.2	2.9469
happiness           	34	0.003748	119	0.001030	9.3	2.7197
blessed             	12	0.001323	18	0.000156	1.4	2.6510
manifestly          	11	0.001213	14	0.000121	1.1	2.6235
life                	39	0.004299	161	0.001393	12.6	2.4949
excellence          	19	0.002094	62	0.000536	4.9	2.4326
...


Document Collections for Testing
--------------------------------

Two document collections are provided as .zip files that can be
downloaded from the course website: Ethics by Artistotle (10 Books),
and Essays from Ralph Waldo Emerson. These were
downloaded from Project Gutenberg.  They are provided in both
concatenated form (1 file), and split up into separate documents.  

To get a quick overview of the contents of the individual documents, see this:

https://en.wikipedia.org/wiki/Nicomachean_Ethics
https://en.wikipedia.org/wiki/Essays:_First_Series