1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
Who is working together?
Name | Organization | ||
---|---|---|---|
Paul Schreiber | [email protected] | @paulschreiber | self |
Which challenge are you working on?
How would you categorize the PDFs?
PDF URL | Document Title |
---|---|
http://www.domain.org/docs/docurl.pdf | Report of Economic Data 2012 |
We used a OS X Automator workflow to extract the full text (PDFs were several hundred pages) and then used Python to extract the country names and sections about torture.
I've only tried this with the machine-generated PDFs, not with the scanned-image (but OCRd) PDFs. It will likely need tweaking for that. For the really old reports, an automated parsing system isn't feasible, and this will require a human to read over and interpret the data.
Extracting rich text (RTF?) instead of plain text will make it easier to find headings and subheadings, instead of using heuristics around length and capitalization.