Created
January 31, 2025 06:48
-
-
Save rjurney/c2e27f3643e65b28efdbd8b009db41c8 to your computer and use it in GitHub Desktop.
Extracting the text from the GraphFrames paper with PyPDF
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pypdf import PdfReader | |
# Load the PDF. The GraphFrames paper normally resides at | |
# https://people.eecs.berkeley.edu/~matei/papers/2016/grades_graphframes.pdf | |
reader = PdfReader("data/grades_graphframes.pdf") | |
# Extract text from all pages | |
text = "\n".join([page.extract_text() for page in reader.pages if page.extract_text()]) | |
# Write it to a text file | |
with open("data/grades_graphframes.txt", "w") as f: | |
f.write(text) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment