This project involves scraping the text of 726,875 movie reviews from a variety of websites. We will provide a list of the movie reviews we are looking to get. This list contains URLs and an identifier.
You will find/create the tools to automatize visiting the website URLs and scrape the relevant text on those website. The scraped review text from the website will need to be stored as a flat text file with the identifier being the file name.
We have previously completed a test run with a sub-sample (~10,000 yielding 7549 files, we will provide this to the person we hire) and found the following problems that we would like you to address:
1. Paywalls: we suggest you determine which websites this concerns and what their access fees are. We will review this information and inform you on further steps.
2. Broken links or no link (up to 25% based on test sample): the above-mentioned list contains title/excerpt/source/author information. Please suggest how to search for the desired information, for instance, using google search to find an alternative source for the review we are looking for and scrape it.
3. Links not actually leading to the desired information (main page or being redirected as content is no longer available/in a different place): we are open for suggestions similar to point 2.
4. Odd encoding of the website making it hard to get the correct information: results in very small files being stored, suggestions on how to deal with this are welcome.
Problems named under 1, 3, and 4 likely result in small files. In the test sample, ~1100 files had file sizes under 1 kb.
When you run into other problems we expect you to inform us and work with us to create solutions where those are feasible.
The person taking this job has experience with text scraping and especially doing so with a diversity of web-based source material, has a proactive attitude and is a creative problem-solver.
If this is you, we look forward to your application.