Skip to content

Instantly share code, notes, and snippets.

@gtfierro
Last active May 11, 2022 16:57
Show Gist options
  • Save gtfierro/8324883 to your computer and use it in GitHub Desktop.
Save gtfierro/8324883 to your computer and use it in GitHub Desktop.
Quick shell script for parallel OCR on PDFs using ghostscript and tesseract
#!/bin/bash
# requires ghostscript (http://www.ghostscript.com/)
# requires ImageMagick
# requires tesseract (https://code.google.com/p/tesseract-ocr/)
# requires GNU parallel (https://www.gnu.org/software/parallel/)
# all of these are typically available through yum/apt/brew/etc.
# number of cores over which the process will be parallelized
num_cores=$1
# converts each of the PDFs into TIFF images so that tesseract can interact with them
ind . -name '*.pdf' | parallel --gnu -j $NUMCORES convert -depth 8 -density 200 {}[0-19] {}.tif
# runs OCR on the found TIFF files and converts them to text. Assumes English, but you can supply
# extra arguments to tesseract
find . -name '*.tif' | parallel -j $NUMCORES tesseract -l eng {} {}
@jon4thin
Copy link

If you just want to run 1 proc per CPU core:

find . -name '*.pdf' | parallel convert -depth 8 -density 200 {} {.}.tif find . -name '*.tif' | parallel tesseract -l eng {} {}

what exactly did you do to call each job on a separate core?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment