Skip to content

Instantly share code, notes, and snippets.

@lefth
Forked from wcaleb/ocrpdf.sh
Last active November 3, 2021 15:04
Show Gist options
  • Save lefth/5ca9f885b10257812ae010e691bcc3bf to your computer and use it in GitHub Desktop.
Save lefth/5ca9f885b10257812ae010e691bcc3bf to your computer and use it in GitHub Desktop.
Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable
#!/bin/bash
# NOTE: I recommend pdfsandwich instead of this script, partly because imagemagick (and pdftoppm) fail on large detailed images.
# While that technique does not preserve the original graphics, it can come close.
# To preserve color:
# pdfsandwich -rgb input.pdf
# To preserve grey tones:
# pdfsandwich -gray input.pdf
# To disable all preprocessing:
# pdfsandwich -nopreproc input.pdf
set -m # turn on job control for parallel processes
# Source:
# https://gist.github.com/wcaleb/7337097
# https://gist.github.com/jburon/d31e0132dfb291dc804bac019f9d9023
#
# Changes:
# - Don't delete files with wildcards. Always use a (random) prefix.
# - Fix extensions of generated files.
# - Don't use greyscale because it's not compatible with some versions of tesseract.
# - Clean up all generated files afterwards.
# - Keep the hocr2pdf command from another fork, but comment it out because it failed in my tests.
# - Process pages in parallel instead of using multithreading in tesseract (which is less efficient).
# Override the job parallelism by setting THREAD_COUNT.
# Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable.
# Hacked together using tips from these websites:
# http://www.jlaundry.com/2012/ocr-a-scanned-pdf-with-tesseract/
# http://askubuntu.com/questions/27097/how-to-print-a-regular-file-to-pdf-from-command-line
# Dependencies: pdftk, tesseract, imagemagick, enscript or hocr2pdf/pdfsandwich
function process_page() {
local FILE=$1
echo "Processing $FILE"
local PAGE=$(basename "$FILE" .pdf)
# Convert the PDF page into a TIFF file
local IMG=$PAGE.tif
convert -density 600 "$FILE" "$IMG"
# OCR the TIFF file and save text to output.txt or output.hocr
OMP_THREAD_LIMIT=1 tesseract "$IMG" "${PAGE}_output"
# Turn text file outputed by tesseract into a PDF, then put it in background of original page
#enscript output.txt -B -o - | ps2pdf - output.pdf && pdftk $FILE background output.pdf output new-"$FILE"
enscript "${PAGE}_output.txt" -B -o - | ps2pdf - "${PAGE}_output.pdf" && pdftk "$FILE" background "${PAGE}_output.pdf" output "new-$FILE"
#tesseract "$IMG" "${PAGE}_output" hocr
## Turn html outputed by tesseract into a PDF, combined with the original image as foreground
#hocr2pdf -i "$IMG" -o "new-${FILE}" < "${PAGE}_output.hocr"
# Clean up
rm "$PAGE"*
}
function wait_jobs() {
while [[ $(jobs -r | wc -l) -gt $((${THREAD_COUNT:-$(nproc)} - 1)) ]]; do
sleep 0.25
done
}
if [[ $# -eq 0 || ! -e $1 ]]
then
echo "Adds an OCR text layer to a PDF file to make searching easier."
echo "Usage: $0 <pdf file>"
exit
fi
TEMPNAME=$(mktemp -p . -u)
TEMPNAME=${TEMPNAME/.\//} # remove "./"
[[ -e $TEMPNAME ]] && echo "Could not create temp filenames" && exit
cp $1 $1.bak
pdftk $1 burst output "${TEMPNAME}_tesspage_%05d.pdf"
for FILE in ${TEMPNAME}_tesspage*
do
process_page "$FILE" &
wait_jobs
done
wait
pdftk "new-${TEMPNAME}"* cat output $1
# Clean up
rm doc_data.txt "new-${TEMPNAME}"*
@lefth
Copy link
Author

lefth commented Nov 3, 2021

Process pages in parallel instead of using multithreading in tesseract (which is less efficient). Override the job parallelism by setting THREAD_COUNT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment