This project involves locating interviews with particular film industry professionals (directors, producers and actors/actresses) from a defined list of websites/magazines/newspapers, scraping the text of each interview and storing it in a separate text file (using the following naming convention: [personID][interview number (001-xxx)].txt). You are asked to collect interviews available from a shortlist of sources though at least 10 interviews per person.
The project will consist of the following steps for each list of persons:
1. Determine method of access to the intended data sources (websites/magazines/newspapers), which we will provide a list of 10 websites for.
2. Query the 10 websites for interviews with the persons on the list, examine if interview contains evidence of the interviewee being quoted (quotation marks in combination with prose, name in combination with verb indicative of speech) and scrape the interview if it meets the aforementioned criterion.
3. Supplement where necessary with top hits in a Google search (name person + interview), determine which of these are from sources not included in the list used for step 1, and execute Step 2 on the additional sources found until a sufficient number of interviews per person is reached.
4. Extract the parts from the interview text that are quotes/speech/something the person you queried for said, so taking out all other parts, and storing in a 'cleaned' text file.
1. Directors: 380 (overlaps with producers, total for both is 722)
2. Producers: 605 (overlaps with directors, total for both is 722)
3. Actors/actresses: 713 (will overlap to some degree with list 1 and 2, we have not examined this yet)
The person taking this job has experience with web crawlers and text scraping, can work with a wide range of source material for text scraping, has a proactive attitude and is a creative problem-solver.
If this is you, we look forward to your application.