-
-
Save wcaleb/7337097 to your computer and use it in GitHub Desktop.
#!/bin/sh | |
# Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable. | |
# Hacked together using tips from these websites: | |
# http://www.jlaundry.com/2012/ocr-a-scanned-pdf-with-tesseract/ | |
# http://askubuntu.com/questions/27097/how-to-print-a-regular-file-to-pdf-from-command-line | |
# Dependencies: pdftk, tesseract, imagemagick, enscript, ps2pdf | |
# Would be nice to use hocr2pdf instead so that the text lines up with the PDF image. | |
# http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/ | |
cp $1 $1.bak | |
pdftk $1 burst output tesspage_%02d.pdf | |
for file in `ls tesspage*` | |
do | |
PAGE=$(basename "$file" .pdf) | |
# Convert the PDF page into a TIFF file | |
convert -monochrome -density 600 $file "$PAGE".tif | |
# OCR the TIFF file and save text to output.txt | |
tesseract "$PAGE".tif output | |
# Turn text file outputed by tesseract into a PDF, then put it in background of original page | |
enscript output.txt -B -o - | ps2pdf - output.pdf && pdftk $file background output.pdf output new-"$file" | |
# Clean up | |
rm output* | |
rm "$file" | |
rm *.tif | |
done | |
pdftk new* cat output $1 |
tesseract can now produce PDF with embedded text directly using the PDF
config option. It's used something like this:
tesseract input.tif outputbase pdf
which would create outputbase.pdf
tesseract can now produce PDF with embedded text directly using the
tesseract input.tif outputbase pdf
which would create outputbase.pdf
scruss,
Thank you for stating that! That simplifies the process significantly! Plus I now have all the packages on our server needed to convert PDFs to embedded text PDFs. I do not have to go through our IT approval process to get ocrmypdf installed, tesseract can do it.
Thanks!
I would say that the most modern variant is ocrmypdf
, which is a nice wrapper above tesseract and is adding some extra features. Its natively in linux repos.
ocrmypdf
That's what I mostly use now. But this gist served me well for years
I would say that the most modern variant is
ocrmypdf
, which is a nice wrapper above tesseract and is adding some extra features. Its natively in linux repos.
Available...yes and no. ocrmypdf isn't available on all corporate repos, but tesseract is more available. I ran into this at a former workplace that did a lot of DoD type work and had a pretty restrictive Linux VM. ocrmypdf wasn't readily available, however tesseract was.
Make sure you read this script before using. Removes stuff via. wildcards.