Created
August 30, 2016 22:29
-
-
Save achikin/cb64a80ffe4fbf46da96dc03b7d0996c to your computer and use it in GitHub Desktop.
Docker file for doc2text
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
FROM ubuntu:16.04 | |
WORKDIR /my/ | |
RUN apt-get -qq -y update | |
RUN apt-get -qq -y install python | |
RUN apt-get -qq -y install python-pip tesseract-ocr python-pythonmagick libopencv-dev python-opencv | |
RUN pip install doc2text | |
ADD dtt.py /my/ | |
ADD image.png /my/ | |
CMD ["/usr/bin/python","/my/dtt.py"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import doc2text | |
# Initialize the class. | |
doc = doc2text.Document() | |
# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff. | |
# If reading a PDF, doc2text will split the PDF into its component pages. | |
doc.read('/my/image.png') | |
# Crop the pages down to estimated text regions, deskew, and optimize for OCR. | |
doc.process() | |
# Extract text from the pages. | |
doc.extract_text() | |
text = doc.get_text() | |
print text |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment