Output to stdout:
tesseract doc.png stdout
Output to stdout, and assume only digits:
tesseract doc.png stdout digits
Output extracted text, as doc.txt
:
tesseract doc.tif doc txt
Output PDF with searchable text, as doc.pdf
:
tesseract doc.tif doc pdf
Output TSV with bounding boxes, confidence and text, as doc.tsv
:
tesseract doc.tif doc tsv
Output HTML file of hierarchical bounding boxes with classes showing
confidence, as doc.hocr
:
tesseract doc.tif doc hocr
Output the image that Tess "sees" after its preprocessing, before OCR,
as tessinput.tif
; this is very useful to understand why it's not
extracting text well (the output can be txt
or any of the other
formats):
tesseract -c tessedit_write_images=1 doc.tif doc txt
We can do all these at once:
tesseract -c tessedit_write_images=1 doc.tif doc txt pdf tsv hocr