This script or set of script is meant to run for me as the end-user only.
I. Web scraper
A. scrape data from site X
1. Download images
2. Save url for reference
3. Use object-oriented model
a. main objects are page, search result, result item, result detail (profile), result detail items (profile elements)
4. Magic numbers for url creations
II. Database storer
A. Store objects in database with low-redundancy schema
1. image blob likely it's own table
2. a hash to compare if online object has changed or not
3. merging and deduplication rules
a. for example, always add images, but merge text changes
III. CRON job or Task-scheduler
A. Ability to delay scraping
B. Ability to define total scrape time
C. Ability to define total scrape objects
D. Database cleanup and maintenance jobs (e.g. merge and dedup async with scrape)
IV. (Advanced) Integrate user login, and store user metrics
A. Metrics on user interaction with other users, and other metadata
1. Analytics/Machine Learning on user interaction with other users
B. Sentiment analysis (e.g. Vader sentiment Python module) for profiles
C. POST data back to website
1. individual POST
2. Bulk POST e.g. scheduled through CRON
3. Smart POST e.g. BULK POST where posted message body depends on profile content
As written, there are multiple major parts I-IV, and minor within. This is the full scope. It's ok to quote on the beginning middle and or end pieces.
- Must be Python in 2.7!
- The goal is to teach me as well how to understand, write, and implement these things for future maintenance. Ideally, development is done via live session (e.g. GotoWebinar). Therefore, initially it's fixed cost to find the right person (skill to code, and skill to teach/explain). Then it's hourly to take the project through the advanced steps.
- Keeping project simple is best. Fewer modules, and module familiar to me are preferred, such as BeautifulSoup, requests, pandas. Also, simple design that it just to the point, less decoration, less wrappers, less double-underscore overrides, less complex object structure (inheritance).