Skip to content

Instantly share code, notes, and snippets.

@shentonfreude
Last active June 15, 2018 21:55
Show Gist options
  • Save shentonfreude/3fd028c5151dbb4b1eb2884cea3a3734 to your computer and use it in GitHub Desktop.
Save shentonfreude/3fd028c5151dbb4b1eb2884cea3a3734 to your computer and use it in GitHub Desktop.

Tesseract notes

Output to stdout:

tesseract doc.png stdout

Output to stdout, and assume only digits:

tesseract doc.png stdout digits

Output extracted text, as doc.txt:

tesseract doc.tif doc txt

Output PDF with searchable text, as doc.pdf:

tesseract doc.tif doc pdf

Output TSV with bounding boxes, confidence and text, as doc.tsv:

tesseract doc.tif doc tsv

Output HTML file of hierarchical bounding boxes with classes showing confidence, as doc.hocr:

tesseract doc.tif doc hocr

Output the image that Tess "sees" after its preprocessing, before OCR, as tessinput.tif; this is very useful to understand why it's not extracting text well (the output can be txt or any of the other formats):

tesseract -c tessedit_write_images=1 doc.tif doc txt

We can do all these at once:

tesseract -c tessedit_write_images=1 doc.tif doc txt pdf tsv hocr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment