I have 738,049 LinkedIn profile urls that I would like data scraped from:
Some of the profiles are public i.e. do not require being logged in to LinkedIn to view. Others require you to be logged in to LinkedIn. I only need the public profiles be scraped. Even so, it is possible that LinkedIn restricts page views by the same IP address so using a rotating proxy may be necessary.
I want to scrape the data into create 4 TSV files (tab separated values) with the following rows.
Note that a given user can attend multiple schools, have worked at multiple jobs, have multiple skills. So the user_id will not uniquely identify a row in the education.tsv, experience.tsv or skills.tsv files.
I would also like all of the html pages of the profiles to be saved such that it would be easy to extract more data from them later should the need arise.