Last active
October 3, 2020 10:26
-
-
Save jbothma/1245ac0ba9c51a8df6676b475ed47fc9 to your computer and use it in GitHub Desktop.
Adding OCR text to a PDF (even a noisy one)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# usage: ocr-pdf scanned.pdf | |
set -euf -o pipefail -o xtrace | |
INFILE=$1 | |
BASENAME=$(basename "$1" .pdf) | |
TIFFFILE=$BASENAME.tiff | |
OCRDPDFNOEXT=$BASENAME-OCRd-big | |
OCRDPDF=$OCRDPDFNOEXT.pdf | |
SMALLEROCRDPDF=$BASENAME-OCRd.pdf | |
# Make a multipage TIFF of the original PDF ~700MB | |
gs -o "$TIFFFILE" -sDEVICE=tiff32nc -r300 "$INFILE" | |
# OCR the TIFF using tesseract4 alpha in docker, store as PDF of images+positioned text ~16MB | |
docker run --rm -v $PWD:$PWD tesseractshadow/tesseract4re tesseract "$PWD/$TIFFFILE" "$PWD/$OCRDPDFNOEXT" pdf | |
rm "$TIFFFILE" | |
# Convert images in PDF to jpeg to reduce size ~4MB | |
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -sOutputFile="$SMALLEROCRDPDF" "$OCRDPDF" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment