Using the pdfplumber library to extract tabular data from California's WARN Act data documents

(this writeup supplements a computational journalism assignment)

Jeremy Singer-Vine has been working on a new PDF-to-text/spreadsheet library native to Python: pdfplumber

The California WARN Act PDFs can be found at:

http://www.edd.ca.gov/jobs_and_training/Layoff_Services_WARN.htm

These examples require the pdfplumber library, which you can install at your system's command-line like so:

pip install pdfplumber

Download one of the pdfs and save to disk

import requests
url = 'http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf'
pdf_fname = 'CAWARN-eddwarncn12.pdf'
resp = requests.get(url)
with open(pdf_fname, 'wb') as f:
    f.write(resp.content)

Interactively test out pdfplumber

Assuming that a file is saved at CAWARN-eddwarncn12.pdf, open it, play around with it, use type() to figure out what each object is and how it corresponds to what you see in the PDF.

import pdfplumber
pdf_fname = 'CAWARN-eddwarncn12.pdf'
# open the pdf
pdf = pdfplumber.open(pdf_fname)
type(pdf)
# => pdfplumber.pdf.PDF

# get the first page
page = pdf.pages[0]
type(page)
# => pdfplumber.page.Page

Using the `extract_table()` method

The pdfplumber documentation has more info on the Page object's methods; the one that's especially convenient is extract_table(), which attempts to preserve the tabular structure of tabular data.

For reference's sake, here's what the first page of the PDF we downloaded looks like:

Using extract_table() will return a list of lists:

table = page.extract_table()
table[0][0]
# 'Company Name'
table[1][0]
# 'AAR MOBILITY SYSTEMS'
table[0][2]
# 'Employees\nAffected'
table[1][2] 
# '48'

Here's the contents of the table variable (it's a list of lists):

[['Company Name', 'Location', 'Employees\nAffected', 'Layoff\nDate'],
 ['AAR MOBILITY SYSTEMS', 'MCCLELLAN AFB', '48', '6/15/12'],
 ['ABBOTT VASCULAR', 'MURRIETA', '45', '1/25/12'],
 ['ABBOTT VASCULAR', 'MURRIETA', '38', '10/17/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '247', '1/25/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '7', '1/25/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '139', '10/17/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '16', '10/17/12'],
 ['ABEO MANAGEMENT CORPORATION', 'LOS ANGELES', '42', '11/28/12'],
 ['ABERCROMBIE & FITCH', 'ANAHEIM', '51', '1/14/12'],
 ['ABERCROMBIE & FITCH', 'CAPITOLA', '51', '1/21/12'],
 ['ABERCROMBIE & FITCH', 'RIVERSIDE', '64', '1/14/12'],
 ['ABERCROMBIE & FITCH', 'SAN DIEGO', '66', '12/29/12'],
 ['ABERCROMBIE & FITCH', 'SIMI VALLEY', '70', '3/24/12'],
 ['ABERCROMBIE & FITCH', 'SIMI VALLEY', '47', '3/24/12'],
 ['ADAMS RITE MANUFACTURING \nCOMPANY', 'PONOMA', '110', '5/25/12'],
 ['ADOBE SYSTEMS INCORPORATED', 'SAN FRANCISCO', '121', '1/31/12'],
 ['ADOBE SYSTEMS INCORPORATED', 'SAN JOSE', '103', '1/31/12'],
 ['ADVANCED MICRO DEVICES, INC', 'SUNNYVALE', '107', '10/25/12']]

Simple aggregation

To add all of the second columns in the table while skipping the header:

employees_affected = 0
for row in table[1:]:
    employees_affected += int(row[2])

Download all the pdfs with a list and a loop

Warning: this will just dump the PDFs to whatever directory you're currently in:

import requests
from os.path import basename
urls = [
    'http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf',
    'http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn13.pdf',
    'http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn14.pdf',
    'http://www.edd.ca.gov/jobs_and_training/warn/WARN_Interim_041614_to_063014.pdf',
    'http://www.edd.ca.gov/jobs_and_training/warn/WARNReportfor7-1-2014to06-30-2015.pdf',
    'http://www.edd.ca.gov/jobs_and_training/warn/WARN-Report-for-7-1-2015-to-03-25-2016.pdf'
]

for url in urls:
    pdf_fname = 'CAWARN-' + basename(url)
    print("Downloading", url, 'into', pdf_fname)
    resp = requests.get(url)
    with open(pdf_fname, 'wb') as f:
        f.write(resp.content)

Use the glob module/function to get a wildcard list of filenames:

from glob import glob
pdf_filenames = glob('CAWARN-*.pdf')
for pdf_fname in pdf_filenames:
    print("This is a filename of a pdf:", pdf_fname)

Summing the second-column across the entire document

If you wanted to sum up the second column across all tables in this particular PDF, here's one way to do it (but it won't work for all the PDFs unfortunately:

import pdfplumber
pdf_fname = 'CAWARN-eddwarncn12.pdf'

pdf = pdfplumber.open(pdf_fname)
page = pdf.pages[0]
table = page.extract_table()

employees_affected = 0
for row in table[1:]:
    employees_affected += int(row[2])
print(employees_affected)

Creating a CSV from just one page of one PDF

import csv
import pdfplumber
pdf_fname = 'CAWARN-eddwarncn12.pdf'

outfile = open('CAWARN-one-page.csv', 'w')
outcsv = csv.writer(outfile)

pdf = pdfplumber.open(pdf_fname)
page = pdf.pages[0]
for row in table[1:]:  # note how I'm still skipping the header
    outcsv.writerow(row)
outfile.close

For a single PDF, save all of its tables to one CSV:

import csv
import pdfplumber
pdf_fname = 'CAWARN-eddwarncn12.pdf'

outfile = open('CAWARN-all-pages.csv', 'w')
outcsv = csv.writer(outfile)

pdf = pdfplumber.open(pdf_fname)
for page in pdf.pages:
    table = page.extract_table()
    for row in table[1:]:  # note how I'm still skipping the header
        outcsv.writerow(row)
outfile.close

dannguyen/compjour-2016-WARN-pdfs-pdfplumber.md