Last active
May 11, 2022 16:57
-
-
Save gtfierro/8324883 to your computer and use it in GitHub Desktop.
Quick shell script for parallel OCR on PDFs using ghostscript and tesseract
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# requires ghostscript (http://www.ghostscript.com/) | |
# requires ImageMagick | |
# requires tesseract (https://code.google.com/p/tesseract-ocr/) | |
# requires GNU parallel (https://www.gnu.org/software/parallel/) | |
# all of these are typically available through yum/apt/brew/etc. | |
# number of cores over which the process will be parallelized | |
num_cores=$1 | |
# converts each of the PDFs into TIFF images so that tesseract can interact with them | |
ind . -name '*.pdf' | parallel --gnu -j $NUMCORES convert -depth 8 -density 200 {}[0-19] {}.tif | |
# runs OCR on the found TIFF files and converts them to text. Assumes English, but you can supply | |
# extra arguments to tesseract | |
find . -name '*.tif' | parallel -j $NUMCORES tesseract -l eng {} {} | |
If you just want to run 1 proc per CPU core:
find . -name '*.pdf' | parallel convert -depth 8 -density 200 {} {.}.tif
find . -name '*.tif' | parallel tesseract -l eng {} {}
If you just want to run 1 proc per CPU core:
find . -name '*.pdf' | parallel convert -depth 8 -density 200 {} {.}.tif
find . -name '*.tif' | parallel tesseract -l eng {} {}
what exactly did you do to call each job on a separate core?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
hey, i try to run your script and got the following error, anyway you can help me resolve this?
I have a pdf and this script in the same folder and ran $bash run_ocr.sh 1
and got the following error:
"parallel: Error: Parsing of --jobs/-j/--max-procs/-P failed."