April 26, 2012


Attached is a sample PDFs .. i need this to be able to be searchable and text that can be copied.

I think there is 2 ways to do it:

1. Play with settings of PDF. Maybe it can be modified:

These PDFs are created used non unicode fonts, there hebrew symbols stay in place of extended latin (that's what you see when you copy the text). The trick to make it working without OCR (and inevitable errors + pdf becoming huge, 'cos this text should be converted to image, and text is added as one more layer) is to change each font encoding.

Right in the body of PDF

<[224/agrave 227/atilde 233/eacute 236/igrave 238/icircumflex]/Type/Encoding>>

here this should be analysed thoroughly which Latin character stays in place of each Hebrew character. And then it is possible to modify the encoding, sort of

<[224/ 227/tavhebrew 233/alefhebrew 236/hethebrew 238/bethebrew]/Type/Encoding>>

(arbitrary changed to Hebrew alphabet)

This wouldn't work straight through PDF, but if you print it to PostScript, it can be done with simple text replacement, and distilling back to PDF. Hence, easily automated.

It's not that much about codes, there are letter names, so for the fonts that are in wrong encoding it takes changing names, i.e. 'dotaccent' to 'ved' and so on. Complicated stuff indeed, but would yield the best possible results. Takes a deep knowledge of pdf/postscript & fonts though.

2. Convert to searchable PDF using OCR

If required I can provide similar fonts to what used in PDF.

I have 1000's of PDfs like this and i have attached about 30 samples. If you can fix this please show me sample of 1 page so i can see.


