A large number of PDFs (~300) need to be converted to UTF-8 text totaling ~3,000 pages of text. Some of these have image segments, and table segments. All have headers and footers that are not wanted. Basically, just the text body of the page is desired, with the content in its entirety. However, document titles (as on coverpage) are needed.
Documents must first be “stitched together” into a single, master pdf document, with the output being a master text document.
Prior experience with and access to professional Optical Character Resolution (OCR) software is requested.
Please describe similar work/experience.
NOTE: Tesseract if freeware but access to a professional version is preferable.