Skip to content

Instantly share code, notes, and snippets.

@sepastian
Created November 12, 2019 17:23
Show Gist options
  • Save sepastian/b2c8d28f9375c8e69b55d1f24a99a403 to your computer and use it in GitHub Desktop.
Save sepastian/b2c8d28f9375c8e69b55d1f24a99a403 to your computer and use it in GitHub Desktop.
Recursively perform OCR on images, output searchable PDFs
#!/bin/bash
# Current directory contains a folder named 'img',
# which contains images nested in subfolders, e.g.
#
# img
# folder1
# img1.jpg
# img2.jpg
# folder2
# img3.jpg
#
# For each image in img, create a searchable PDF in pdf.
#
# pdf
# folder1
# img1.jpg.pdf
# img2.jpg.pdf
# folder2
# img3.jpg.pdf
#
# Requires tesseract.
find img -type f | while read i;
do
o="${i/img/pdf}";
d=`dirname "${o}"`;
mkdir -p "${d}";
n="${o}.pdf";
echo "${n}";
tesseract "${i}" stdout pdf > "${n}";
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment