We have two realization of the software needed:
- the one we use that have all functionality but the code is worth that causes problems in support and scaling;
- just bought one - not based on out task, but similar to it and code is much more accurate.
We need to redevelop project based on the second software with full list of features from current one.
Project could be divided on three parts:
- import data from approximately 50 sites (photo and video);
- data processing (manual and auto) - thumbnails cut, categorizing content;
- export aggregated data to xml for thousands of clients.
1. Multi-thread parser.
Now we only work with csv dumps, but in future we need to realize fronted html parser to cover sites that doesn't provide export tools.
2. Data processing.
- automatic mode - all thumbnails are categorized based on synonymous dictionary according to title and categories in input source;
- manual mode - user select/crop thumbnail and associate with the relevant category in UI.
3. XML export.
XML file is generated according to many parameters in client request. This part is fully ready in our software, the task will be to implement the same logic in a new one, because base structure varies much.
I suggest one task as a start of work on a project - detect and remove watermark from thumbnails automatically.
- input: set of pictures with one watermark (text or logo) placed at any corner;
- output: cropped image without watermarks with saved aspect ratio.
You can use OpenVC for this task if needed.