-
-
Save vinovator/c78c2cb63d62fdd9fb67 to your computer and use it in GitHub Desktop.
# pdfTextMiner.py | |
# Python 2.7.6 | |
# For Python 3.x use pdfminer3k module | |
# This link has useful information on components of the program | |
# https://euske.github.io/pdfminer/programming.html | |
# http://denis.papathanasiou.org/posts/2010.08.04.post.html | |
''' Important classes to remember | |
PDFParser - fetches data from pdf file | |
PDFDocument - stores data parsed by PDFParser | |
PDFPageInterpreter - processes page contents from PDFDocument | |
PDFDevice - translates processed information from PDFPageInterpreter to whatever you need | |
PDFResourceManager - Stores shared resources such as fonts or images used by both PDFPageInterpreter and PDFDevice | |
LAParams - A layout analyzer returns a LTPage object for each page in the PDF document | |
PDFPageAggregator - Extract the decive to page aggregator to get LT object elements | |
''' | |
import os | |
from pdfminer.pdfparser import PDFParser | |
from pdfminer.pdfdocument import PDFDocument | |
from pdfminer.pdfpage import PDFPage | |
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter | |
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter | |
from pdfminer.pdfdevice import PDFDevice | |
# Import this to raise exception whenever text extraction from PDF is not allowed | |
from pdfminer.pdfpage import PDFTextExtractionNotAllowed | |
from pdfminer.layout import LAParams, LTTextBox, LTTextLine | |
from pdfminer.converter import PDFPageAggregator | |
''' This is what we are trying to do: | |
1) Transfer information from PDF file to PDF document object. This is done using parser | |
2) Open the PDF file | |
3) Parse the file using PDFParser object | |
4) Assign the parsed content to PDFDocument object | |
5) Now the information in this PDFDocumet object has to be processed. For this we need | |
PDFPageInterpreter, PDFDevice and PDFResourceManager | |
6) Finally process the file page by page | |
''' | |
base_path = "C://some_folder" | |
my_file = os.path.join(base_path + "/" + "test_pdf.pdf") | |
log_file = os.path.join(base_path + "/" + "pdf_log.txt") | |
password = "" | |
extracted_text = "" | |
# Open and read the pdf file in binary mode | |
fp = open(my_file, "rb") | |
# Create parser object to parse the pdf content | |
parser = PDFParser(fp) | |
# Store the parsed content in PDFDocument object | |
document = PDFDocument(parser, password) | |
# Check if document is extractable, if not abort | |
if not document.is_extractable: | |
raise PDFTextExtractionNotAllowed | |
# Create PDFResourceManager object that stores shared resources such as fonts or images | |
rsrcmgr = PDFResourceManager() | |
# set parameters for analysis | |
laparams = LAParams() | |
# Create a PDFDevice object which translates interpreted information into desired format | |
# Device needs to be connected to resource manager to store shared resources | |
# device = PDFDevice(rsrcmgr) | |
# Extract the decive to page aggregator to get LT object elements | |
device = PDFPageAggregator(rsrcmgr, laparams=laparams) | |
# Create interpreter object to process page content from PDFDocument | |
# Interpreter needs to be connected to resource manager for shared resources and device | |
interpreter = PDFPageInterpreter(rsrcmgr, device) | |
# Ok now that we have everything to process a pdf document, lets process it page by page | |
for page in PDFPage.create_pages(document): | |
# As the interpreter processes the page stored in PDFDocument object | |
interpreter.process_page(page) | |
# The device renders the layout from interpreter | |
layout = device.get_result() | |
# Out of the many LT objects within layout, we are interested in LTTextBox and LTTextLine | |
for lt_obj in layout: | |
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine): | |
extracted_text += lt_obj.get_text() | |
#close the pdf file | |
fp.close() | |
# print (extracted_text.encode("utf-8")) | |
with open(log_file, "w") as my_log: | |
my_log.write(extracted_text.encode("utf-8")) | |
print("Done !!") |
Thanks - adding the W also worked for me. Neat code. thanks vinovator
Thanks. but as i tried to run it in python it says "ImportError: No module named pdfparser". What's wrong with it?
@jeanpancho you might try that stackoverflow post
Hey,
what does LTPage stand for?
Thanks!
Anyone else getting this error when trying to write to my_log?
TypeError: write() argument must be str, not bytes
Edit: As mentioned above, changing "w" for "wb" solved my problem.
it is too old, now this can not work.
it is too old, now this can not work.
pdfminer.six is working same code
thank you so much for this simple implementation. Do you know how do we get more features(location, font, size,etc) of the text?
thank you so much for this simple implementation. Do you know how do we get more features(location, font, size,etc) of the text?
The LTChar class contains the location, font size and (internal) font name for each character. You may find the LTChar objects by iterating through the children of each container recursively. At least that's what I found to work best so far. See https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/layout.py#L228
Sample code for finding all characters with their locations and font information:
def find_characters(container):
"""Returns list of dicts containing (char,box,fontname,fontsize)"""
chars = []
for child in container:
if isinstance(child, Layout.LTChar):
char = {
'char': child.get_text(),
'box': child.bbox,
'fontname': child.fontname,
'fontsize': child.size
}
chars.append(char)
elif isinstance(child, Layout.LTComponent):
chars += find_characters(child)
return chars
it worked on my python 3.7.3 windows 10 computer(no pdfminer3k needed), even can directly handle Chinese without further extract setup.
many thanks.
Still works as advertised for Win10 & python 3.9.7...as noted above, 'with open' needs to be changed from "w" to "wb"
Thanks a bunch, Vinovator
JT
Thanks. Small correction worked for me "wb" and not "w":
with open(log_file, "wb") as my_log
In any case, the best short example I found. This PDFminer3k is parsing and reading PDF text that PyPDF2 was not able to read.