i am scanning documents and want to ocr them. noise forces tesseract to recognize numbers. with magick i can remove these pixels, but even periods between numbers are removed. I only want to remove pixels with no close neighborhood to bigger blobs (text).
the sample I sent I want to have these pixels removed.
i used tesseract like:
tesseract -c include_page_breaks=1 -c preserve_interword_spaces=1 -l deu -psm 4 noisesample.tif noise_sample
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
I even sent the modified file how it should look afterwards with noise removed
i'd prefer to let tesseract to the job not to increase processing time, but any tool is welcome as long as it runs on linux/debian
if this works well it would be also nice to remove all kind of graphics - especially straight lines horizontally.
August 31, 2018
I am willing to pay higher rates for the most experienced freelancers