1_pdfliberation_hackathon_activity.md

Raw

1_pdfliberation_hackathon_activity.md

Three simple steps to using this gist

Fork this Gist
Edit your version to share what your team's activity
There's no third step

Markdown-Cheatsheet

Raw

2_who.md

Who

Who is working together?

Name	Email	Twitter	Organization
Seamus Kraft	[email protected]	@seamuskraft	OpenGov Foundation
Ross Tsiomenko	[email protected]	n/a	OpenGov Foundation

Raw

4_pdfs.md

PDF Samples

How would you categorize the PDFs?

Sample documents

PDF URL	Document Title
http://www.domain.org/docs/docurl.pdf	Report of Economic Data 2012

Content category

Number of pages

1 page
2 to 9 pages
10+ pages
100+ pages

Other observations

Collection includes PDFs made from scanned documents
PDFs include hand-written text

PDF Generation

Human authored
Machine generated
God only knows

Raw

6_tools.md

Tool

What tool(s) are you using to extract the data?

Tool	How we used it
Tabula	We used it to manually select and extract a table of data

Notes

Raw

7_how.md

How

How did you extracted the desired data that produced the best results?

Improvements

What would have to be changed/added to the tool or process to achieve success?

Raw

9_code.md

Code

Please list sample code, tips, and howtos of your processing pipeline.

seamuskraft/1_pdfliberation_hackathon_activity.md

Three simple steps to using this gist

Who

Challenge

PDF Samples

Sample documents

Content category

Number of pages

Other observations

PDF Generation

Type of data embedded in PDF

Desired output of data

Tool

Notes

How

Improvements

Results quality

Speed

Notes

Code