Content Aggregator - Lucene/Solr/Elastic Search

Web, Mobile & Software Dev Other - Software Development Posted 3 years ago


Less than 30 hrs/week
1 to 3 months


System required to deliver named entity indexing and retrieval capability over the internet against domains containing news and blog articles. Probably well suited to Lucence/Solr/Elastic Search.

A list of Named Entities will be available to the system. This will be a dynamic and growing set, inititally starting with thousands of Named entities but growing to potentially millions.  The named enitities will be organisations (companies, clubs, government departments etc), and will in the main be supplied with an industry classification, a locality, and a  URL for their website.

The system will be required to do the following:
1. Crawl a dynamic list of domains (perpetual crawl,  hourly to daily, definable by system admin dependent on domain) and identify new content as it comes online. (List initially to contain ~500 domains but will ultimately expand to tens of thousands. (Domains are news sites and blogs. Content required is articles with surrounding material such as ads or other links to be ignored).

2. Allow categorisation of article by either source or tags.  An article can have multiple tags Tags can be found in different locations dependent on site (eg In some instances they might be in the article header files ) so multiple alternative processes will have to be implemented as and when they are identified, and admins must have facility to nominate how tags are deduced.

3. Employ Named entity recognition to identify organisations. Disambiguation must be used to correctly identify an organisation (eg “Mercury” could refer to Mercury Marine, Mercury Communications, Any number of newspapers, an element, a planet or an ancient  Roman deity.) The potentially the system will later require to employ NER on individuals or brands as well. Create and maintain an index of  the Named Entities against articles and dates.

4. Perform a search to match articles identified by the crawler from within a nominated time/date range against a list of named entities submitted from an external system. Article URLs are returned. System must support  potentially thousands of automated requests an hour from the external system.

The articles themselves don’t necessarily have to be saved – just the fact a named entity appears in it, the time and date it was crawled, and it’s URL.

  • Other Skills:

Activity on this Job

Last Viewed by Client: 3 years ago

Invites Sent: 0

Unanswered Invites: 0

Hired: 1

About the Client

(5.00) 1 review

Broadbeach 01:19 AM

4 Jobs Posted
25% Hire Rate, 1 Open Job

$533 Total Spent
1 Hire, 0 Active

$44.44/hr Avg Hourly Rate Paid
12 Hours

Member Since Jan 4, 2013