we are a research team for an experimental purpose project at lab, in a couple of years may bring it to the real world
1) we want to download, cache and archive lots of pages, retaining the sitemap graph, and origin of the page, and metadata, resources, attachments preserved
2) they should be well organized, archived, indexed and filtering out unwanted pages, we may develop own algorithm for it.
3) easily programmable and integrating with many of our in house software for data intensive processing
4) easy of monitoring, troubleshooting and capacity management and planning, backing up pages for additional storages.
5)potentially integrated with hadoop and mapreduce, bigtable and support clustering and network storage
6)potentially integrated with additional third party software, like sql server, noSql, mongo, and drupal, so forth.
Skills: research, troubleshooting, management