System required to deliver named entity indexing and retrieval capability over the internet against domains containing news and blog articles. Probably well suited to Lucence/Solr/Elastic Search.
A list of Named Entities will be available to the system. This will be a dynamic and growing set, inititally starting with thousands of Named entities but growing to potentially millions. The named enitities will be organisations (companies, clubs, government departments etc), and will in the main be supplied with an industry classification, a locality, and a URL for their website.
The system will be required to do the following:
1. Crawl a dynamic list of domains (perpetual crawl, hourly to daily, definable by system admin dependent on domain) and identify new content as it comes online. (List initially to contain ~500 domains but will ultimately expand to tens of thousands. (Domains are news sites and blogs. Content required is articles with surrounding material such as ads or other links to be ignored).
2. Allow categorisation of article by either source or tags. An article can have multiple tags Tags can be found in different locations dependent on site (eg In some instances they might be in the article header files ) so multiple alternative processes will have to be implemented as and when they are identified, and admins must have facility to nominate how tags are deduced.
3. Employ Named entity recognition to identify organisations. Disambiguation must be used to correctly identify an organisation (eg “Mercury” could refer to Mercury Marine, Mercury Communications, Any number of newspapers, an element, a planet or an ancient Roman deity.) The potentially the system will later require to employ NER on individuals or brands as well. Create and maintain an index of the Named Entities against articles and dates.
4. Perform a search to match articles identified by the crawler from within a nominated time/date range against a list of named entities submitted from an external system. Article URLs are returned. System must support potentially thousands of automated requests an hour from the external system.
The articles themselves don’t necessarily have to be saved – just the fact a named entity appears in it, the time and date it was crawled, and it’s URL.