Kristian Risager Larsen kristianrl

kristianrl / pdf-ocr-to-txt.sh

Last active July 10, 2024 09:38

Extract OCR contents in PDF-documents as plain text (.txt)

	# Extract OCR contents in PDF-documents as plain text (.txt)
	# Kristian Risager Larsen, 2024-07
	#
	# Setup:
	# You need to install GhostsScript and Tesseract
	# brew install tesseract tesseract-lang ghostscript
	#
	# Notes:
	# The "-l dan" parameter tells Tesseract to expect Danish text