Created
November 6, 2013 14:41
-
-
Save wcaleb/7337097 to your computer and use it in GitHub Desktop.
Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
# Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable. | |
# Hacked together using tips from these websites: | |
# http://www.jlaundry.com/2012/ocr-a-scanned-pdf-with-tesseract/ | |
# http://askubuntu.com/questions/27097/how-to-print-a-regular-file-to-pdf-from-command-line | |
# Dependencies: pdftk, tesseract, imagemagick, enscript, ps2pdf | |
# Would be nice to use hocr2pdf instead so that the text lines up with the PDF image. | |
# http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/ | |
cp $1 $1.bak | |
pdftk $1 burst output tesspage_%02d.pdf | |
for file in `ls tesspage*` | |
do | |
PAGE=$(basename "$file" .pdf) | |
# Convert the PDF page into a TIFF file | |
convert -monochrome -density 600 $file "$PAGE".tif | |
# OCR the TIFF file and save text to output.txt | |
tesseract "$PAGE".tif output | |
# Turn text file outputed by tesseract into a PDF, then put it in background of original page | |
enscript output.txt -B -o - | ps2pdf - output.pdf && pdftk $file background output.pdf output new-"$file" | |
# Clean up | |
rm output* | |
rm "$file" | |
rm *.tif | |
done | |
pdftk new* cat output $1 |
I would say that the most modern variant is
ocrmypdf
, which is a nice wrapper above tesseract and is adding some extra features. Its natively in linux repos.
Available...yes and no. ocrmypdf isn't available on all corporate repos, but tesseract is more available. I ran into this at a former workplace that did a lot of DoD type work and had a pretty restrictive Linux VM. ocrmypdf wasn't readily available, however tesseract was.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
That's what I mostly use now. But this gist served me well for years