In summary: I want to be able to configure scrapy for multiple locations via a simple website. I want scrapy to grab a session token, spoof the IP, grab my data and save the CSV to an S3 bucket.
I want to be able to:
1) login to my own secure website hosted in AWS
2) display simple 4 column form with column names (see attachment)
3) Setup new scrapes
4) refresh recurring scrapes
3) in detail
For setting up New Scrapes: "Get New DataSource" launches new tab or similar (e.g., Chrome extension?) wherein I login into my new datasource and then navigate to the area that I want to scrape, specify the table and somehow specify "Get Data". It should be able to handle easier REST url requests or more difficult ones with an obscured header variables).
While I'm open to variation, I'm envisioning something similar to the pinterest chrome extension but with regards to data tables within secure websites.
Once, the scrape configuration is saved, then it starts 4) get data "refresh"
4) in detail
click "REFRESH" spawns new tab wherein user only logs in. Session token is grabbed by service. All requested data is navigated to and pulled on the back end. Note: some IP spoofing on the login or on the backend service will be required.
5) back end service should exist as AWS Lambda callable code. As such, variables should reside separately and load per request.
6) I anticipate using this with a node.js service ... so, looking for callable compliance (i.e., I know that scrapy is natively python)
7) data should be saved consistently/statically to a dedicated S3 bucket (per logged in user) ... authenticated URL can be made available.
Finally, I'm okay with pulling in Scrapy and AWS libraries. I do want to minimize code complexity beyond that am looking for clean, well documented, quick code.