I am looking for a solution that will analyze a jpeg image file, and extract the text data and dump it into a database in an unstructured form. The next step would be to then analyze that unstructured data and pull out the key pieces of information to then create a structured database with specific data.
The idea is to take the daily image files put out by the county with information such as deeds, debts, and liens. Scan those images, extract all of the text within each document, and then be able to add certain bits of information to a database.
For example, let's say that there is an image file that includes a notice of a mortgage for a property. This image would include certain bits of information such as the "folio number" which is the unique property identifier for the property with the county, the mortgage amount, and maybe the terms of the mortgage. I would like for the solution to be able to extract that data, and then dump that into a database table, so that I can then link it with another table about the property.
I was thinking of using apache tika to extract the data, and then Pig to parse it. However, if you are an expert in this, maybe you have a better way.