1_pdfliberation_hackathon_activity.md

Raw

1_pdfliberation_hackathon_activity.md

Start Here

1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity

PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet

Raw

2_who.md

Who

Who is working together?

Name	Email	Twitter	Organization
Paul Schreiber	[email protected]	@paulschreiber	self

Raw

4_pdfs.md

PDF Samples

How would you categorize the PDFs?

Sample documents

PDF URL	Document Title
http://www.domain.org/docs/docurl.pdf	Report of Economic Data 2012

Content category

Number of pages

1 page
2 to 9 pages
10+ pages
100+ pages

Other observations

Collection includes PDFs made from scanned documents
PDFs include hand-written text

PDF Generation

Human authored
Machine generated
God only knows

Raw

6_tools.md

Tool

What tool(s) are you using to extract the data?

Tool	How we used it
Automator	We used it to extract raw text from the PDF
Python	We used it to extract the torture-related data from the full text

Notes

Raw

7_how.md

How

We used a OS X Automator workflow to extract the full text (PDFs were several hundred pages) and then used Python to extract the country names and sections about torture.

Improvements

I've only tried this with the machine-generated PDFs, not with the scanned-image (but OCRd) PDFs. It will likely need tweaking for that. For the really old reports, an automated parsing system isn't feasible, and this will require a human to read over and interpret the data.

Extracting rich text (RTF?) instead of plain text will make it easier to find headings and subheadings, instead of using heuristics around length and capitalization.

Raw

9_code.md

Code

Find non-scanned PDFs using Google
Select PDFs in the Finder
Run the Automator workflow
Run the Python script (it doesn't yet take a parameter for the input file, so I've been editing the script on each run)

paulschreiber/1_pdfliberation_hackathon_activity.md

Start Here

Who

Challenge

PDF Samples

Sample documents

Content category

Number of pages

Other observations

PDF Generation

Type of data embedded in PDF

Desired output of data

Tool

Notes

How

Improvements

Results quality

Speed

Notes

Code