We would like to create a mobile application that takes pictures of Belgian vehicle official document, send them on a server to processed them with OCR (Tesseract) to extract relevant information.
We previously posted a job description for this subject. However, we refined the scope of the job because previous one was too ambitious and not accurate enough on how we wanted to work on. We have decreased the scope of the job, but with more details on how to proceed.
Before developing the complete stack, we need to check if the document recognition is feasible or not, what processing should be done.
We have identified several steps to be done :
1) Define the image processing and the Tesseract configuration required to extract relevant information from a scanned document.
2) Maybe a training phase for Tesseract.
3) Define the image processing to correct problems coming from smartphone’s picture (wide-angle lens, parallax, screwing…)
Here we will focus on the first step only. However, if the collaboration is good and the results are good, we will keep working on the next steps with the hired person.
What is asked
The approach we would like to follow is :
- Using Tesseract’s confidence score to determine a several image processing methods,
- Using a key - value method to extract the relevant fields. You can see that on the attached documents, information are prefixed with a key (C.1.3, P.1. …)
It means that the preprocessing phase is an iterative process. Several OCR passes are used in loop with different image-preprocessing parameters filters ; best result is selected by confidence. Tesseract provides confidence in OCR result for each recognized word.
You’ll have to:
- Define several image processing,
- Write a script to preprocess an image (as above) and provide the recognition of the relevant information (as hash).
For information : during our first tests, we find out that using a leveling on a grey-level image, with a Tesseract trained for French gave interesting results.
How to process
We will provide you several scanned document to use as inputs. As they are official documents, we will provide them only after we agree to work together.
The sample images here are for information and their quality is poor. Better pictures, non obfuscated, will be provided to you before working on.
We have also detected restrictions on the value fields (alphanumeric, numeric, list values ...)
You should provide us :
- The script ( our preference is python, tell us if you have another preference),
- The configuration of Tesseract,
- The processed images,
- Documentation on how to install, run the script, the dependencies…
- Information about stable errors.
We hope to not have stable errors in Tesseract recognition. But if we face this issue, it won’t mean the project will be refused. It means that we’ll have to do another training step.
In fact, we are trying to find out the specific font used on the document in order to train Tesseract for this font.
We’ll validate your findings by processing the images with the provided script and running the OCR against the processed images.
Other non-functional specifications
- The OS should be Ubuntu,
- You’ll have to use the latest Tesseract version, trained for French,
- The script should be in python,
- The only requested interface is command line.