I have created a scrapy script to crawl a website, but it seems very slow so I need it reviewed\modified to function faster. I am not sure if the issue is the scripts, the proxy, the server, or what and that is what I need discovered and fixed.
The target site is ~400 starting URLs. Each URL will have ~120 urls then a "next" page. I limit the "next" calls to only 600 total URLS. I am currently only getting ~ 20/minute. I did have an issue where if I set all 400 start urls the script crashes, so now it runs 1 or 2 start urls and then at the end calls scrapoxyd, to run the next start URL. I also compare the URLs to a remote database to not call crawl the same URL multiple times (like deltafetch). I will sometimes run 2-3 of the spiders at the same time but it is still the one target site, just different start urls.
I need someone to review what I have done and let me know where they slowness is from and how to fix it. I am using a DigitalOcean droplet to run the scripts. I have a Scrapxy proxy server running AWS nano instances and the scripts calls a remote database to store the data and to duplicate DeltaFetch behavior. The database calls are through SQLAlchemy. I am not using deltafetch because I had the database crash a few times and am not sure why.
I need to have all 400 start URLs and the sub URLS crawled and done in less than a day so I need to know all the changes needed to do that.
The delivery for this is a fixed script and\or clear directions to fix the issue and confirmation that the changes do indeed get the entire run under a day. This is a fixed price project, but I will provide the script and access to the proxies and the remote database. You will need to run this on your server or a new droplet I can spin up until the timing is as needed because I am still running the script on our droplet.
Please ask for any clarification now so we are on the same page and can both be successful.
March 12, 2018
I am looking for a mix of experience and value