Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save dwcaraway/8493368 to your computer and use it in GitHub Desktop.
Save dwcaraway/8493368 to your computer and use it in GitHub Desktop.

Challenge

Which challenge are you working on?

  • Amnesty International Annual Reports – Torture Incident Database
  • Comprehensive Annual Financial Reports
  • Federal Communications Commission Daily Releases
  • House of Representatives Financial Disclosures (OpenSecrets.org)
  • IRS Form 990 – Not-for-Profit Organization Reports
  • New York City Council and Community Board Documents
  • New York City Economic Development Commission Monthly Snapshot
  • New York City Environmental Impact Statements
  • US Foreign Aid Reports (USAID)
  • Other: List/Describe here

PDF Samples

How would you categorize the PDFs?

Sample documents

PDF URL Document Title
http://www.domain.org/docs/docurl.pdf Report of Economic Data 2012

Content category

  • Disclosure (filing, forms, report, ...)
  • Legislative doc (laws, analysis, ...)
  • Financial (statements, reports)
  • Government statistical data
  • Non-Government statistical data
  • Press (press releases, statements, ...)
  • Government reports
  • Non-Government reports
  • Directory
  • Other:

Number of pages

  • 1 page
  • 2 to 9 pages
  • 10+ pages
  • 100+ pages

Other observations

  • Collection includes PDFs made from scanned documents
  • PDFs include hand-written text

PDF Generation

  • Human authored
  • Machine generated
  • God only knows

Type of data embedded in PDF

  • Simple table of data
  • Complex table of data
  • Multiple tables of data from document
  • Table data greater than one page in length
  • Highly-structured form data
  • Loosely-structured form data
  • Has human-written text
  • Structure text of a report (e.g., headings, subheadings, ...)
  • Other:

Desired output of data

  • CSV
  • JSON
  • text version (e.g., markdown)

Tool

What tool(s) are you using to extract the data?

Tool How we used it
DocHive We forked it and as glue for a PDF processing toolchain

Notes

How

How did you extract the desired data that produced the best results?

Improvements

What would have to be changed/added to the tool or process to achieve success?

Results quality

  • 99%+
  • 90%+
  • 80%+
  • 50% to 75%
  • less than 50%
  • utter crap

Speed

How fast is the data extracted?

  • < 10 seconds
  • < 30 seconds
  • < 1 minute
  • < 5 minutes
  • < 10 minutes
  • < 20 minutes
  • Other:

Notes

Code

Please list code, tips and howto's of your processing pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment