Skip to content

Instantly share code, notes, and snippets.

@sepastian
Last active June 18, 2025 12:13
Show Gist options
  • Save sepastian/b2c8d28f9375c8e69b55d1f24a99a403 to your computer and use it in GitHub Desktop.
Save sepastian/b2c8d28f9375c8e69b55d1f24a99a403 to your computer and use it in GitHub Desktop.
Recursively perform OCR on images, output searchable PDFs
# UPDATE: new version is a one-liner, using GNU parallel.
#
# For each image in img/, create a searchable PDF in pdf/.
#
# Requires tesseract and GNU parallel.
#
# Note: the CPU used had 12 cores;
# specifying -j 4 runs 4 parallel processes;
# not specifying -j would result in using all cores, which was very slow;
# it may be possible to use between 4 and 12 cores, needs testing.
mkdir pdf
find img/ -type f -name 'page_*jpg' | parallel -j 4 --verbose 'tesseract {} pdf/{/.} pdf'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment