Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save paulschreiber/8506568 to your computer and use it in GitHub Desktop.
Save paulschreiber/8506568 to your computer and use it in GitHub Desktop.

Who

Who is working together?

Name Email Twitter Organization
Paul Schreiber [email protected] @paulschreiber self

Challenge

Which challenge are you working on?

  • Amnesty International Annual Reports – Torture Incident Database
  • Comprehensive Annual Financial Reports
  • Federal Communications Commission Daily Releases
  • House of Representatives Financial Disclosures (OpenSecrets.org)
  • IRS Form 990 – Not-for-Profit Organization Reports
  • New York City Council and Community Board Documents
  • New York City Economic Development Commission Monthly Snapshot
  • New York City Environmental Impact Statements
  • US Foreign Aid Reports (USAID)
  • Other: List/Describe here

PDF Samples

How would you categorize the PDFs?

Sample documents

PDF URL Document Title
http://www.domain.org/docs/docurl.pdf Report of Economic Data 2012

Content category

  • Disclosure (filing, forms, report, ...)
  • Legislative doc (laws, analysis, ...)
  • Financial (statements, reports)
  • Government statistical data
  • Non-Government statistical data
  • Press (press releases, statements, ...)
  • Government reports
  • Non-Government reports
  • Directory
  • Other:

Number of pages

  • 1 page
  • 2 to 9 pages
  • 10+ pages
  • 100+ pages

Other observations

  • Collection includes PDFs made from scanned documents
  • PDFs include hand-written text

PDF Generation

  • Human authored
  • Machine generated
  • God only knows

Type of data embedded in PDF

  • Simple table of data
  • Complex table of data
  • Multiple tables of data from document
  • Table data greater than one page in length
  • Highly-structured form data
  • Loosely-structured form data
  • Has human-written text
  • Structure text of a report (e.g., headings, subheadings, ...)
  • Other:

Desired output of data

  • CSV
  • JSON
  • text version (e.g., markdown)

Tool

What tool(s) are you using to extract the data?

Tool How we used it
Automator We used it to extract raw text from the PDF
Python We used it to extract the torture-related data from the full text

Notes

How

We used a OS X Automator workflow to extract the full text (PDFs were several hundred pages) and then used Python to extract the country names and sections about torture.

Improvements

I've only tried this with the machine-generated PDFs, not with the scanned-image (but OCRd) PDFs. It will likely need tweaking for that. For the really old reports, an automated parsing system isn't feasible, and this will require a human to read over and interpret the data.

Extracting rich text (RTF?) instead of plain text will make it easier to find headings and subheadings, instead of using heuristics around length and capitalization.

Results quality

  • 99%+
  • 90%+
  • 80%+
  • 50% to 75%
  • less than 50%
  • utter crap

Speed

How fast is the data extracted?

  • < 10 seconds
  • < 30 seconds
  • < 1 minute
  • < 5 minutes
  • < 10 minutes
  • < 20 minutes
  • Other:

Notes

Less than 30 seconds per document, but many documents.

Code

  • Find non-scanned PDFs using Google
  • Select PDFs in the Finder
  • Run the Automator workflow
  • Run the Python script (it doesn't yet take a parameter for the input file, so I've been editing the script on each run)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment