(this writeup supplements a computational journalism assignment)
Jeremy Singer-Vine has been working on a new PDF-to-text/spreadsheet library native to Python: pdfplumber
The California WARN Act PDFs can be found at:
http://www.edd.ca.gov/jobs_and_training/Layoff_Services_WARN.htm
These examples require the pdfplumber library, which you can install at your system's command-line like so:
pip install pdfplumber
import requests
url = 'http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf'
pdf_fname = 'CAWARN-eddwarncn12.pdf'
resp = requests.get(url)
with open(pdf_fname, 'wb') as f:
f.write(resp.content)
Assuming that a file is saved at CAWARN-eddwarncn12.pdf, open it, play around with it, use type()
to figure out what each object is and how it corresponds to what you see in the PDF.
import pdfplumber
pdf_fname = 'CAWARN-eddwarncn12.pdf'
# open the pdf
pdf = pdfplumber.open(pdf_fname)
type(pdf)
# => pdfplumber.pdf.PDF
# get the first page
page = pdf.pages[0]
type(page)
# => pdfplumber.page.Page
The pdfplumber documentation has more info on the Page
object's methods; the one that's especially convenient is extract_table()
, which attempts to preserve the tabular structure of tabular data.
For reference's sake, here's what the first page of the PDF we downloaded looks like:
Using extract_table()
will return a list of lists:
table = page.extract_table()
table[0][0]
# 'Company Name'
table[1][0]
# 'AAR MOBILITY SYSTEMS'
table[0][2]
# 'Employees\nAffected'
table[1][2]
# '48'
Here's the contents of the table
variable (it's a list of lists):
[['Company Name', 'Location', 'Employees\nAffected', 'Layoff\nDate'],
['AAR MOBILITY SYSTEMS', 'MCCLELLAN AFB', '48', '6/15/12'],
['ABBOTT VASCULAR', 'MURRIETA', '45', '1/25/12'],
['ABBOTT VASCULAR', 'MURRIETA', '38', '10/17/12'],
['ABBOTT VASCULAR', 'TEMECULA', '247', '1/25/12'],
['ABBOTT VASCULAR', 'TEMECULA', '7', '1/25/12'],
['ABBOTT VASCULAR', 'TEMECULA', '139', '10/17/12'],
['ABBOTT VASCULAR', 'TEMECULA', '16', '10/17/12'],
['ABEO MANAGEMENT CORPORATION', 'LOS ANGELES', '42', '11/28/12'],
['ABERCROMBIE & FITCH', 'ANAHEIM', '51', '1/14/12'],
['ABERCROMBIE & FITCH', 'CAPITOLA', '51', '1/21/12'],
['ABERCROMBIE & FITCH', 'RIVERSIDE', '64', '1/14/12'],
['ABERCROMBIE & FITCH', 'SAN DIEGO', '66', '12/29/12'],
['ABERCROMBIE & FITCH', 'SIMI VALLEY', '70', '3/24/12'],
['ABERCROMBIE & FITCH', 'SIMI VALLEY', '47', '3/24/12'],
['ADAMS RITE MANUFACTURING \nCOMPANY', 'PONOMA', '110', '5/25/12'],
['ADOBE SYSTEMS INCORPORATED', 'SAN FRANCISCO', '121', '1/31/12'],
['ADOBE SYSTEMS INCORPORATED', 'SAN JOSE', '103', '1/31/12'],
['ADVANCED MICRO DEVICES, INC', 'SUNNYVALE', '107', '10/25/12']]
To add all of the second columns in the table while skipping the header:
employees_affected = 0
for row in table[1:]:
employees_affected += int(row[2])
Warning: this will just dump the PDFs to whatever directory you're currently in:
import requests
from os.path import basename
urls = [
'http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf',
'http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn13.pdf',
'http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn14.pdf',
'http://www.edd.ca.gov/jobs_and_training/warn/WARN_Interim_041614_to_063014.pdf',
'http://www.edd.ca.gov/jobs_and_training/warn/WARNReportfor7-1-2014to06-30-2015.pdf',
'http://www.edd.ca.gov/jobs_and_training/warn/WARN-Report-for-7-1-2015-to-03-25-2016.pdf'
]
for url in urls:
pdf_fname = 'CAWARN-' + basename(url)
print("Downloading", url, 'into', pdf_fname)
resp = requests.get(url)
with open(pdf_fname, 'wb') as f:
f.write(resp.content)
Use the glob module/function to get a wildcard list of filenames:
from glob import glob
pdf_filenames = glob('CAWARN-*.pdf')
for pdf_fname in pdf_filenames:
print("This is a filename of a pdf:", pdf_fname)
If you wanted to sum up the second column across all tables in this particular PDF, here's one way to do it (but it won't work for all the PDFs unfortunately:
import pdfplumber
pdf_fname = 'CAWARN-eddwarncn12.pdf'
pdf = pdfplumber.open(pdf_fname)
page = pdf.pages[0]
table = page.extract_table()
employees_affected = 0
for row in table[1:]:
employees_affected += int(row[2])
print(employees_affected)
import csv
import pdfplumber
pdf_fname = 'CAWARN-eddwarncn12.pdf'
outfile = open('CAWARN-one-page.csv', 'w')
outcsv = csv.writer(outfile)
pdf = pdfplumber.open(pdf_fname)
page = pdf.pages[0]
for row in table[1:]: # note how I'm still skipping the header
outcsv.writerow(row)
outfile.close
import csv
import pdfplumber
pdf_fname = 'CAWARN-eddwarncn12.pdf'
outfile = open('CAWARN-all-pages.csv', 'w')
outcsv = csv.writer(outfile)
pdf = pdfplumber.open(pdf_fname)
for page in pdf.pages:
table = page.extract_table()
for row in table[1:]: # note how I'm still skipping the header
outcsv.writerow(row)
outfile.close