Skip to content

Instantly share code, notes, and snippets.

@kristianrl
Last active July 10, 2024 09:38
Show Gist options
  • Save kristianrl/f3cafbb92553e9585c812467e8c9ebed to your computer and use it in GitHub Desktop.
Save kristianrl/f3cafbb92553e9585c812467e8c9ebed to your computer and use it in GitHub Desktop.
Extract OCR contents in PDF-documents as plain text (.txt)
# Extract OCR contents in PDF-documents as plain text (.txt)
# Kristian Risager Larsen, 2024-07
#
# Setup:
# You need to install GhostsScript and Tesseract
# brew install tesseract tesseract-lang ghostscript
#
# Notes:
# The "-l dan" parameter tells Tesseract to expect Danish text
for filename in *.pdf; do
gs -dNOPAUSE -sDEVICE=pngalpha -r300 -sOutputFile="${filename%.pdf}%03d.png" "$filename" -c quit
done
for filename in *.png; do
tesseract "$filename" "${filename%.png}.txt" -l dan
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment