Website scraping

Closed - This job posting has been filled and work has been completed.
Web & Mobile Development Other - Web & Mobile Development Posted 3 years ago


Hours to be determined
Less than 1 month

Start Date

January 10, 2013


The project is to improve on how we scrape and parse websites.

Following describes what we currently do.

We have a long list of specific websites, all news related.

We currently use QueryPath and a queue process.

We have default QueryPath selector strings for things like title, byline, article, and omit (for things to remove from the article).

We have custom selector strings for specific websites to override the defaults.

We have a test page for each site to test parsing a reference page and save the QueryPath strings.

We first retrieve the full HTML for the page and do our best to look for URI to second or more pages for the same article. Each site has their own convention for page URI.

Next we parse the article, title, and byline. We have extensive regular expressions to help.

About the Client

(4.99) 110 reviews

United States
Sumas 10:19 PM

258 Jobs Posted
61% Hire Rate, 1 Open Job

Over $50,000 Total Spent
216 Hires, 18 Active

$25.87/hr Avg Hourly Rate Paid
12,576 Hours

Member Since Nov 29, 2007