Skip to content

Instantly share code, notes, and snippets.

@chrismooredev
Last active September 23, 2020 16:43
Show Gist options
  • Save chrismooredev/436c674ceac6be30c2a8d38736af22b5 to your computer and use it in GitHub Desktop.
Save chrismooredev/436c674ceac6be30c2a8d38736af22b5 to your computer and use it in GitHub Desktop.
Separates pages from a scanned book (in PDF format), straightens the pages, and OCRs the whole thing.
#!/bin/bash
if [ $# -eq 0 ] ; then
echo "Usage: $0 <input PDF> [output filename] [custom temp dir]"
echo "\tSlices, Straightens, and OCRs scanned PDFs from books"
echo ""
echo "\tIf [output filename] is not provided, then outputs to <input>.normalized.pdf
echo "\tIf [custom temp dir] is not provided, then uses a directory from 'mktemp' which is automatically deleted"
echo
echo "\tUses commands from packages poppler-utils, imagemagick, img2pdf, ghostscript, ocrmypdf"
echo "\tAnd deskew, which must be retrieved from github ( https://github.com/galfar/deskew )"
exit 2
fi
ifile="$1"
ofile="${2:-$(basename "$ifile" .pdf).normalized.pdf}"
tmp="${3:-$(mktemp -d -p '' 'process-scanned-pdf.XXXXXX')}"
rotate_degrees=270
mkdir -p "$tmp"
mkdir -p "$tmp/raw"
raw="source"
# uses commands
# pdfimages (poppler-utils)
# convert (imagemagick)
# deskew (find on github)
# img2pdf (img2pdf)
# gs (ghostscript)
# ocrmypdf (ocrmypdf)
# Extract PDF's pages into individual images
pdfimages -png "$ifile" "$tmp/raw/$raw"
# Seperate joint pages, and straighten them if they are crooked
find "$tmp/raw/" -name "$raw-*.png" -type f -print0 | while read -r -d $'\0' file ; do
echo "$file"
imgnbr="$(( 10#${file//[!0-9]/} ))"
convert "$file" -rotate "$rotate_degrees" -crop '2x1@' "$tmp/raw_page-tmp-%d.png"
deskew -b FFFFFF -o "$tmp/raw_page-$(( imgnbr * 2 + 0 )).png" "$tmp/raw_page-tmp-0.png"
deskew -b FFFFFF -o "$tmp/raw_page-$(( imgnbr * 2 + 1 )).png" "$tmp/raw_page-tmp-1.png"
img2pdf --pagesize A4 --output "$tmp/clean_page-$(( imgnbr * 2 + 0 )).pdf" "$tmp/raw_page-$(( imgnbr * 2 + 0 )).png"
img2pdf --pagesize A4 --output "$tmp/clean_page-$(( imgnbr * 2 + 1 )).pdf" "$tmp/raw_page-$(( imgnbr * 2 + 1 )).png"
done
rm "$tmp/raw_page-tmp-0.png" "$tmp/raw_page-tmp-1.png"
# Combine them back into one PDF
find "$tmp/" -name "clean_page-*.pdf" -type f -print0 | \
sort --zero-terminated --numeric-sort -t '-' --key=2 | \
xargs -d '\0' \
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile="$tmp/cleaned.pdf"
# OCR PDF
ocrmypdf "$tmp/cleaned.pdf" "$tmp/ocrd.pdf"
mv "$tmp/ocrd.pdf" "$ofile"
# if not $3 (eg: custom temp directory) then...
if [ -z "$3" ] ; then
rm -r "$tmp"
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment