Skip to content

Instantly share code, notes, and snippets.

@cmin764
Created October 6, 2021 07:43
Show Gist options
  • Select an option

  • Save cmin764/7ddfada0d541ceaa91d724964403a0a3 to your computer and use it in GitHub Desktop.

Select an option

Save cmin764/7ddfada0d541ceaa91d724964403a0a3 to your computer and use it in GitHub Desktop.
#! /usr/bin/env python3
import logging
import sys
from io import StringIO
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, \
PDFPageAggregator
from pdfminer.image import ImageWriter
from pdfminer.layout import LAParams
from pdfminer.pdfdevice import TagExtractor
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.utils import open_filename
def iter_text_per_page(pdf_file, password='', page_numbers=None, maxpages=0,
caching=True, codec='utf-8', laparams=None):
"""Parse and return the text contained in a PDF file.
:param pdf_file: Either a file path or a file-like object for the PDF file
to be worked on.
:param password: For encrypted PDFs, the password to decrypt.
:param page_numbers: List of zero-indexed page numbers to extract.
:param maxpages: The maximum number of pages to parse
:param caching: If resources should be cached
:param codec: Text decoding codec
:param laparams: An LAParams object from pdfminer.layout. If None, uses
some default settings that often work well.
:return: a string containing all of the text extracted.
"""
if laparams is None:
laparams = LAParams()
with open_filename(pdf_file, "rb") as fp:
rsrcmgr = PDFResourceManager(caching=caching)
idx = 1
for page in PDFPage.get_pages(
fp,
page_numbers,
maxpages=maxpages,
password=password,
caching=caching,
):
with StringIO() as output_string:
device = TextConverter(rsrcmgr, output_string, codec=codec,
laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
interpreter.process_page(page)
yield idx, output_string.getvalue()
idx += 1
def main():
pdf_file, search = sys.argv[1:3]
for count, page_text in iter_text_per_page(pdf_file):
idx = page_text.find(search)
if idx != -1:
lo = max(0, idx - 128)
hi = min(len(page_text), idx + 128)
content = page_text[lo:hi]
print(f"Found text at page {count}:\n{content}")
#break
if __name__ == "__main__":
main()
@cmin764
Copy link
Author

cmin764 commented Oct 6, 2021

Call it with: % ./pdf2text-pages.py NASDAQ_TSLA_2019.pdf "Form 10-K"

Output:

Found text at page 1:
Statement for the 2020 Annual Meeting of Stockholders are incorporated herein by reference in Part III of this Annual Report on Form 10-K to the extent stated herein. Suchproxy statement will be filed with the Securities and Exchange Commission within 120
Found text at page 3:
 Forward-Looking StatementsThe discussions in this Annual Report on Form 10-K contain forward-looking statements reflecting our current expectations that involve risks anduncertainties. These forw
Found text at page 17:
ees to be good.Available InformationWe file or furnish periodic reports and amendments thereto, including our Annual Reports on Form 10-K, our Quarterly Reports on Form 10-Q andCurrent Reports on Form 8-K, proxy statements and other information with the Se
Found text at page 32:
 amount of indebtedness (see Note 12, Debt, to theconsolidated financial statements included elsewhere in this Annual Report on Form 10-K). Our substantial consolidated indebtedness may increase ourvulnerability to any generally adverse economic and indust
Found text at page 34:
 in Note 16, Commitments and Contingencies, to the consolidated financialstatements included elsewhere in this Annual Report on Form 10-K. To our knowledge, no government agency in any such ongoing investigation has concludedthat any wrongdoing occurred. H
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment