(Using macOS)
Dependencies:
- ImageMagick
- Tesseract
- pdfmerge
curl -OL https://github.com/gsauthof/utility/raw/master/pdfmerge.py
pip install pdfrw
brew install imagemagick
brew install tesseract
Assuming the source file is called report.pdf
:
Convert the PDF into one PNG per page:
convert -density 150 report.pdf +adjoin report-%03d.png
Perform OCR on each page and produce a text-only PDF called textonly.pdf
:
ls report-*.png | tesseract -c textonly_pdf=1 --dpi 150 - textonly pdf
Merge the original PDF with the text-only PDF into a PDF called searchable.pdf
:
python pdfmerge.py --pdfrw textonly.pdf report.pdf searchable.pdf
Using report.pdf, this produced searchable.pdf