Last active
May 11, 2022 16:57
-
-
Save gtfierro/8324883 to your computer and use it in GitHub Desktop.
Quick shell script for parallel OCR on PDFs using ghostscript and tesseract
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# requires ghostscript (http://www.ghostscript.com/) | |
# requires ImageMagick | |
# requires tesseract (https://code.google.com/p/tesseract-ocr/) | |
# requires GNU parallel (https://www.gnu.org/software/parallel/) | |
# all of these are typically available through yum/apt/brew/etc. | |
# number of cores over which the process will be parallelized | |
num_cores=$1 | |
# converts each of the PDFs into TIFF images so that tesseract can interact with them | |
ind . -name '*.pdf' | parallel --gnu -j $NUMCORES convert -depth 8 -density 200 {}[0-19] {}.tif | |
# runs OCR on the found TIFF files and converts them to text. Assumes English, but you can supply | |
# extra arguments to tesseract | |
find . -name '*.tif' | parallel -j $NUMCORES tesseract -l eng {} {} | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
what exactly did you do to call each job on a separate core?