I am looking for an experienced developer to build a customised OCR system for internal use. About the System We receive original PDF documents(not scanned) from about 50 different providers via email. There are approximately 16 pieces of information we require from the PDF document, however the location of the 16 pieces of information are in varies from provider to provider. Each provider has a unique ID, this may assist with the identifying of data locations. There will be validation rules, e.g. Field 1 + Field 2 must equal field 3. The validation rules will be provided. There must be a user interface that enables our internal users to check the data that has been OCR’d, if exceptions (missing data or formula does not equal) are present, a user must be manually able to enter data by viewing the PDF. There is a specific export process that must be adhered to. The original PDF must be renamed (post OCR) to a specific naming protocol. The data that has been exported, must also be exported in a specific csv format, specs for each will be provided. Key Features of the system 1. Must be automated and able to detach a PDF document from an email and the OCR the image. 2. User interface must allow a user to a. View the data that has been OCR’s per PDF, ability to overwrite the data if needed. b. Manage exceptions c. Export data both manually and automatically (at a set time) 3. Ability to easily add another PDF source. 4. Source code will be owned by us, the system must be documented, and handed over to our own internal developers.
Skills: OCR Tesseract .NET Framework OCR algorithms