Wraps command line utility pdfinfo
to extract the PDF meta information. The PDF meta information is returned in a dictionary.
LICENSE
andAUTHOR
file added- This has now been modified to work with either Python 2 or Python 3.
- An example has been added, see
example.py
.
python ./example.py
{'Tagged': 'no', 'Producer': 'Mac OS X 10.12.1 Quartz PDFContext', 'Creator': 'Word', 'Encrypted': 'no', 'Author': 'Shekhar Vemuri', 'File size': '6264512 bytes', 'Optimized': 'no', 'PDF version': '1.3', 'ModDate': 'Thu Dec 8 11:42:16 2016 MST', 'Title': 'Guide to Apache Airflow', 'Page size': '612 x 792 pts (letter)', 'CreationDate': 'Thu Dec 8 11:42:16 2016 MST', 'Pages': '6'}
This script assumes that the pdfinfo
command line command is available at /usr/bin/pdfinfo
.
On debian like Linux, you can install that like this:
sudo apt-get install poppler-utils
The poppler
package appears to be present on MacOS via brew
so this script could be adapted to work on MacOS as well.
Though there's almost certainly a better way of getting this info with a native Python PDF package.
This function parses the text output that looks like this:
Title: PUBLIC MEETING AGENDA
Author: Customer Support
Creator: Microsoft Word 2010
Producer: Microsoft Word 2010
CreationDate: Thu Dec 20 14:44:56 2012
ModDate: Thu Dec 20 14:44:56 2012
Tagged: yes
Pages: 2
Encrypted: no
Page size: 612 x 792 pts (letter)
File size: 104739 bytes
Optimized: no
PDF version: 1.5
_extract updated to handle the CreationDate and ModDate fields.