The project is to improve on how we scrape and parse websites.
Following describes what we currently do.
We have a long list of specific websites, all news related.
We currently use QueryPath and a queue process.
We have default QueryPath selector strings for things like title, byline, article, and omit (for things to remove from the article).
We have custom selector strings for specific websites to override the defaults.
We have a test page for each site to test parsing a reference page and save the QueryPath strings.
We first retrieve the full HTML for the page and do our best to look for URI to second or more pages for the same article. Each site has their own convention for page URI.
Next we parse the article, title, and byline. We have extensive regular expressions to help.