Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created January 31, 2025 06:48
Show Gist options
  • Save rjurney/c2e27f3643e65b28efdbd8b009db41c8 to your computer and use it in GitHub Desktop.
Save rjurney/c2e27f3643e65b28efdbd8b009db41c8 to your computer and use it in GitHub Desktop.
Extracting the text from the GraphFrames paper with PyPDF
from pypdf import PdfReader
# Load the PDF. The GraphFrames paper normally resides at
# https://people.eecs.berkeley.edu/~matei/papers/2016/grades_graphframes.pdf
reader = PdfReader("data/grades_graphframes.pdf")
# Extract text from all pages
text = "\n".join([page.extract_text() for page in reader.pages if page.extract_text()])
# Write it to a text file
with open("data/grades_graphframes.txt", "w") as f:
f.write(text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment