We need a generic crawler, which can:
1) analyze the webpage structure of the input website
2) recognize the webpages where the target information may exist
3) parse and extract required contents in these pages using NLP&ML technologies
4) store scraped contents into their corresponding fields in MongoDB
For example: We need some product information from online store, and we input the URL of one store. The crawler will visit different links in the website and make statistics. He found that some webpages have the similar structure, and this structure has a high repeat rate. Then the crawler will think that this kind of webpages may contain the product information, and will check the contents in it. He will parse the contents, and when he has recognized the contents which we need like Product Name, Product Type, Price, Production Description, he will extract them and store them to the corresponding fields in MongoDB.
To make this crawler, the following skills are required:
• Crawler skills
• MongoDB skills
• Machine learning
• Natural Language processing
We prefer to use Java as programming language. Python is acceptable.
We will reveal more when contacting with you.