Last active
January 20, 2023 03:59
-
-
Save lizettepreiss/c2e58fd12e6ba5917e7fb8f8287f4076 to your computer and use it in GitHub Desktop.
How to parse a .fdf file in Python. The intention was to export comments I'd made in a .pdf (using Adobe Reader DC's commenting capability) and make them available elsewhere to use as a basis for further notes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pdfminer.pdfparser import PDFParser | |
from pdfminer.pdfdocument import PDFDocument | |
from pdfminer.pdftypes import resolve1 | |
# I exported the comments I had made in an Adobe Reader DC document to f:temp/stn.fdf. | |
# Now I wanted to access those comments outside of the Adobe Reader. Here is how I extracted the comments. | |
fdf_file = open("F:/temp/stn.fdf", 'rb') | |
parser = PDFParser(fdf_file) | |
doc = PDFDocument(parser) | |
# Note re the next line in this code: | |
# You might need to put a breakpoint here and actually look at the 'doc.catalog' variable to see what | |
# the catalog value is and replace the one below with yours. 'FDF' and 'Annots' were what they were in my case, | |
# but I've seen other example source code online of how to parse FDF files that had different catalog values. | |
# I have no idea whether the catalog values vary between different PDF readers and even between | |
# versions. | |
# The fdf file I used in this example originated from exporting my comments I made in a PDF when using | |
# Adobe Reader DC version 2020.006.20042 | |
fields = resolve1(doc.catalog['FDF'])['Annots'] | |
for i in fields: | |
field = resolve1(i) | |
# Note re the next line in this code: | |
# You might need to put a breakpoint here and actually look at the 'field' variable to see what | |
# the field names are that you want to extract. 'Page' and 'Content' were what they were in my case, | |
# but I've seen other example source code online showing how to parse FDF files that had different | |
# field names. I have no idea whether the field names vary between different PDF readers and even between | |
# versions. | |
# The fdf file I used in this example originated from exporting my comments I made in a PDF when using | |
# Adobe Reader DC version 2020.006.20042 | |
page, b_contents = field.get('Page'), field.get('Contents') | |
if b_contents is not None: | |
try: | |
contents = b_contents.decode() | |
except Exception as e: | |
pass | |
print("The page number where the comment was made is " + str(page)) | |
print("The contents of your comment is " + contents) | |
fdf_file.close() | |
# I then wrote these values to a .xlsx file so that I could use them elsewhere | |
# (Writing to .xlsx will follow in another gist in due course). | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment