-
-
Save zoidyzoidzoid/e48ddeab1552c868a4c140fd14c4aeb2 to your computer and use it in GitHub Desktop.
Adding OCR text to a PDF (even a noisy one)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
# usage: ./pdf-ocr.sh scanned.pdf | |
# | |
# Download this file | |
# Make it executable: | |
# chmod +x pdf-ocr.sh | |
# Run it on your example file | |
# ./pdf-ocr.sh scanned.pdf | |
set -euf -o pipefail -o xtrace | |
INFILE=$1 | |
BASENAME=$(basename "$1" .pdf) | |
TIFFFILE=$BASENAME.tiff | |
OCRDPDFNOEXT=$BASENAME-OCRd-big | |
OCRDPDF=$OCRDPDFNOEXT.pdf | |
SMALLEROCRDPDF=$BASENAME-OCRd.pdf | |
# Make a multipage TIFF of the original PDF ~700MB | |
gs -o "$TIFFFILE" -sDEVICE=tiff32nc -r300 "$INFILE" | |
# OCR the TIFF using tesseract4 | |
tesseract "$PWD/$TIFFFILE" "$PWD/$OCRDPDFNOEXT" pdf | |
rm "$TIFFFILE" | |
# Convert images in PDF to jpeg to reduce size ~4MB | |
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -sOutputFile="$SMALLEROCRDPDF" "$OCRDPDF" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment