I need you to perform analysis on natural language text corpora.
you would be given a text corpous of ~600K sentences in english (with typos, slang etc) here is what you need to do create a script which:
1) extracts popular phrases (with frequencies) using various methods:
(you can use the ones from here http://www.quora.com/What-is-the-best-way-to-analyze-a-corpus-of-text-to-determine-the-most-popular-phrases and here http://www.quora.com/Whats-the-best-way-to-extract-phrases-from-a-corpus-of-text-using-Python or suggest your own)
2) has a control panel, with parameters for the extraction (length of the n-grams, stop-words lists, work with adjectives, etc)
3) since the dataset contains a lot of spelling errors and slang there should also be a control which allows to work with substitution dictionaries (e.g. we provide a dictionary of popular typos and then run the extraction again with that dictionary in mind)
a sample of the text data is attached
if you have a ready solution (yours or know of someone else's) that is totally fine