Skip to content

Instantly share code, notes, and snippets.

@vehrka
Created January 28, 2015 07:24
Show Gist options
  • Save vehrka/2210108208e0433444e5 to your computer and use it in GitHub Desktop.
Save vehrka/2210108208e0433444e5 to your computer and use it in GitHub Desktop.
OCR de documentos con Imagemagick y Tesseract
#!/bin/sh
STARTPAGE=1 # set to pagenumber of the first page of PDF you wish to convert
ENDPAGE=11 # set to pagenumber of the last page of PDF you wish to convert
SOURCE=source.pdf # set to the file name of the PDF
OUTPUT=destination.txt # set to the final output file
RESOLUTION=75 # set to the resolution the scanner used (for B/W and good scans 75 suffice)
touch $OUTPUT
for i in `seq $STARTPAGE $ENDPAGE`; do
echo extracting page $i
convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] /tmp/page.tif
echo processing page $i
tesseract -l spa -psm 3 /tmp/page.tif /tmp/tempoutput
cat /tmp/tempoutput.txt >> $OUTPUT
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment