Business Problem - We need to know what city, county, state authorities are saying on parts of their websites and don’t want to go out and check them one by one - Well, that would be annoying.
- We don’t need to crawl the whole web, we need to crawl very specific sites
- We need to monitor for updates they make to the page (Web Change)
- We need to pull in posts/articles that are matching specific words (think Boolean including proximity search, “, Parenthesis )
- Some of the files will be PDF’s
- Some of the website structures can be convoluted
1. Provide an Admin Interface where we can:
1. Put in boolean logic that allows us to search for a set of terms like “Project Approved” OR ((“KB Homes” OR “ KB Urban” NEAR “Proposal Submitted”) AND (Dallas OR Frisco OR Arlington”))
2. Put in a set of URL’s where only those are searched
3. Label the type of Crawl that it is Company, City, Event, …
2. Label the crawling logic so we can have a simple name for the crawl job
3. Store the information retrieved into a db for us to access later for various purposes
We would like the data stored into a relational database. If you can also help set up the appropriate deployment on AWS, that would be super.