extract popular phrases from the text corpus

Closed - This job posting has been filled and work has been completed.
Web, Mobile & Software Dev Scripts & Utilities Posted 2 years ago


Hours to be determined
Less than 1 week

Expert Level

I am willing to pay higher rates for the most experienced freelancers


I need you to perform analysis on natural language text corpora.
you would be given a text corpous of ~600K sentences in english (with typos, slang etc) here is what you need to do create a script which:

1) extracts popular phrases (with frequencies) using various methods:
(you can use the ones from here http://www.quora.com/What-is-the-best-way-to-analyze-a-corpus-of-text-to-determine-the-most-popular-phrases and here http://www.quora.com/Whats-the-best-way-to-extract-phrases-from-a-corpus-of-text-using-Python or suggest your own)
2) has a control panel, with parameters for the extraction (length of the n-grams, stop-words lists, work with adjectives, etc)
3) since the dataset contains a lot of spelling errors and slang there should also be a control which allows to work with substitution dictionaries (e.g. we provide a dictionary of popular typos and then run the extraction again with that dictionary in mind)

a sample of the text data is attached

if you have a ready solution (yours or know of someone else's) that is totally fine

About the Client

(4.98) 101 reviews

United States
Sf 02:08 AM

128 Jobs Posted
71% Hire Rate, 1 Open Job

Over $50,000 Total Spent
143 Hires, 1 Active

$19.37/hr Avg Hourly Rate Paid
4,543 Hours

Member Since Sep 3, 2013