Skip to content

Instantly share code, notes, and snippets.

@fiorentinogiuseppe
Created January 17, 2020 18:33
Show Gist options
  • Save fiorentinogiuseppe/8b75fa3e340daac9206bcb4b6f60e703 to your computer and use it in GitHub Desktop.
Save fiorentinogiuseppe/8b75fa3e340daac9206bcb4b6f60e703 to your computer and use it in GitHub Desktop.
Percorre as imagens do PDF `lendo-as` com o pytesseract convertendo imagem em `string`
def get_ocr_documents(images):
"""
Percorre as imagens do PDF `lendo-as` com o pytesseract convertendo
imagem em `string`.
Parameters
----------
images : PIL.Image.Image
Imagens, resultados da conversão do PDF.
Returns
-------
String
Um texto contendo todas as leituras das paginas
"""
pages_text = []
for image in images:
#https://stackoverflow.com/questions/44619077/pytesseract-ocr-multiple-config-options
pages_text.append(pytesseract.image_to_string(image, config='--psm 4' ,lang='eng'))
return ''.join(pages_text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment