Created
May 29, 2017 19:44
-
-
Save tomasfejfar/5c92333b60143f189a69e8a9cde55811 to your computer and use it in GitHub Desktop.
Convert PDF to text file using tesseract and imagemagick in cygwin
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Required cygwin packages: | |
* tesseract-ocr | |
* ghostscript | |
* imagemagick | |
usr/bin/convert.exe -density 400 input.pdf -depth 8 output.tiff | |
tesseract -l eng -psm 1 output.tiff output_textfile |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment