Skip to content

Instantly share code, notes, and snippets.

@wcaleb
Created November 6, 2013 14:41
Show Gist options
  • Save wcaleb/7337097 to your computer and use it in GitHub Desktop.
Save wcaleb/7337097 to your computer and use it in GitHub Desktop.
Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable
#!/bin/sh
# Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable.
# Hacked together using tips from these websites:
# http://www.jlaundry.com/2012/ocr-a-scanned-pdf-with-tesseract/
# http://askubuntu.com/questions/27097/how-to-print-a-regular-file-to-pdf-from-command-line
# Dependencies: pdftk, tesseract, imagemagick, enscript, ps2pdf
# Would be nice to use hocr2pdf instead so that the text lines up with the PDF image.
# http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/
cp $1 $1.bak
pdftk $1 burst output tesspage_%02d.pdf
for file in `ls tesspage*`
do
PAGE=$(basename "$file" .pdf)
# Convert the PDF page into a TIFF file
convert -monochrome -density 600 $file "$PAGE".tif
# OCR the TIFF file and save text to output.txt
tesseract "$PAGE".tif output
# Turn text file outputed by tesseract into a PDF, then put it in background of original page
enscript output.txt -B -o - | ps2pdf - output.pdf && pdftk $file background output.pdf output new-"$file"
# Clean up
rm output*
rm "$file"
rm *.tif
done
pdftk new* cat output $1
@ramack19
Copy link

ramack19 commented Mar 9, 2023

I would say that the most modern variant is ocrmypdf, which is a nice wrapper above tesseract and is adding some extra features. Its natively in linux repos.

Available...yes and no. ocrmypdf isn't available on all corporate repos, but tesseract is more available. I ran into this at a former workplace that did a lot of DoD type work and had a pretty restrictive Linux VM. ocrmypdf wasn't readily available, however tesseract was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment