We need to index a few structured websites. You need to pull down every page, save only the HTML and deliver the data as zip files.
We have experience building crawlers but would prefer if you've built your own before. Bonus points if you can have existing crawlers in Python or another language you can share with us. This is not necessary though, we just need data.
Common problems will be:
- making sure your crawler doesn't get blocked (may need to rate-limit crawler or use several IP's)
- verifying you're collecting all pages and not missing them due to network errors, etc.
We will verify by randomly checking the output for completeness and data integrity after delivery. Thank you.