Tesseract notes

Output to stdout:

tesseract doc.png stdout

Output to stdout, and assume only digits:

tesseract doc.png stdout digits

Output extracted text, as doc.txt:

tesseract doc.tif doc txt

Output PDF with searchable text, as doc.pdf:

tesseract doc.tif doc pdf

Output TSV with bounding boxes, confidence and text, as doc.tsv:

tesseract doc.tif doc tsv

Output HTML file of hierarchical bounding boxes with classes showing confidence, as doc.hocr:

tesseract doc.tif doc hocr

Output the image that Tess "sees" after its preprocessing, before OCR, as tessinput.tif; this is very useful to understand why it's not extracting text well (the output can be txt or any of the other formats):

tesseract -c tessedit_write_images=1 doc.tif doc txt

We can do all these at once:

tesseract -c tessedit_write_images=1 doc.tif doc txt pdf tsv hocr

shentonfreude/tesseract.rst

Tesseract notes