Skip to content

Instantly share code, notes, and snippets.

@rwcitek
Last active November 17, 2024 22:33
Show Gist options
  • Save rwcitek/f6016463474136165a07373f30421653 to your computer and use it in GitHub Desktop.
Save rwcitek/f6016463474136165a07373f30421653 to your computer and use it in GitHub Desktop.
OCR of a PDF with scanned text

Using Docker to OCR text from a scanned PDF

docker container run --rm -d -v "${PWD}":/tmp/zfoo --name ocr ubuntu sleep inf

docker container exec -i ocr /bin/bash << 'eof'
  export DEBIAN_FRONTEND=noninteractive
  apt-get update
  apt-get install -y python3-pip vim less tree tesseract-ocr ghostscript
  pip3 install --break-system-packages ocrmypdf
eof

docker container exec -w /tmp/zfoo ocr \
  ocrmypdf input.pdf output.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment