1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
Who is working together?
Name | Organization | ||
---|---|---|---|
Anna Lukasiak | [email protected] | @adlukasiak | Open JC |
Which challenge are you working on?
How would you categorize the PDFs?
File Name | text or image | No Pages | Category |
---|---|---|---|
CY2011AnnualAudit.pdf | image | 304 | Audit |
CY2011AnnualDebtStatement.pdf | image | 17 | Annual Debt Statement |
CY2011AnnualFinancialStatement.pdf | text | 94 | Financial Statements |
CY2011Budget(Introduced).pdf | image | 66 | Budget |
CY2011BudgetProjections.pdf | image | 2 | Budget Summaries and Projections |
CY2011SurplusProjections.pdf | image | 1 | Budget Summaries and Projections |
CY2012AnnualAudit.pdf | image | 333 | Audit |
CY2012AnnualDebtStatement.pdf | image | 18 | Annual Debt Statement |
CY2012Budget(Adopted).pdf | image | 73 | Budget |
CY2012Budget(Introduced).pdf | image | 72 | Budget |
CY2012BudgetAmendmentIntroduced.pdf | image | 7 | Budget |
CY2013Budget(Adopted).pdf | image | 72 | Budget |
CY2013Budget(Introduced).pdf | image | 147 | Budget |
FY2006AnnualAudit.pdf | text | 223 | Audit |
FY2007AnnualAudit.pdf | text | 213 | Audit |
FY2007AnnualFinancialStatement.pdf | image | 82 | Financial Statements |
FY2007Budget(Adopted).pdf | image | 91 | Budget |
FY2008AnnualAudit.pdf | text | 250 | Audit |
FY2008Budget(Adopted).pdf | image | 75 | Budget |
FY2008Budget(Introduced).pdf | image | 73 | Budget |
FY2009AnnualAudit.pdf | image | 240 | Audit |
FY2009Budget(Adopted).pdf | image | 77 | Budget |
FY2010AnnualAudit.pdf | image | 236 | Audit |
TY2010AnnualAudit.pdf | text | 268 | Audit |
TY2010AnnualFinancialStatement.pdf | image | 91 | Financial Statements |
TY2010CorrectiveActionPlan.pdf | image | 14 | Audit |
FY2010AnnualFinancialStatement.pdf | image | 88 | Financial Statements |
FY2010Budget(Adopted).pdf | image | 75 | Budget |
FY2010Budget(Introduced).pdf | image | 75 | Budget |
FY2010TransitionYearBudget(Adopted).pdf | image | 65 | Budget |
AnnualFinancialStatement2012.pdf | image | 102 | Financial Statements |
AnnualFinancialStatement2009.pdf | image | 87 | Financial Statements |
AnnualFinancialStatement2008.pdf | image | 89 | Financial Statements |
correctiveactionplan2008.pdf | image | 24 | Audit |
correctiveactionplan2007.pdf | image | 29 | Audit |
correctiveactionplan2006.pdf | image | 32 | Audit |
CY2011BudgetIntroduced.pdf | image | 66 | Budget |
Category | No Files | No Pages |
---|---|---|
text | 22 | 2379 |
image | 15 | 1492 |
total | 37 | 3871 |
What tool(s) are you using to extract the data?
Tool | How we used it |
---|---|
ABBYY Cloud OCR SDK | The python script calls ABBYY api for files that are not searchable |
Tabula | To test, we used it to manually select and extract a table of data. There are over 30 files, looking to automate it. |
Python script | Building python script to automate all steps for the 37 files |
The python script will grab file names from the official Jersey City Website, call ABBYY and TABULA api's, and scrape the results. To make the budget data useful for interactive visualization, it's ideal to create hierarchical json files. To do this, need to extract revenue and spending data and ignore the subtotals plus need to link each spending number to an account, program, division and department.
How did you extract the desired data that produced the best results?
./downloads
Beware, the proces takes 30 mins!strings filename | grep Font
to test if the file is searchableWhat would have to be changed/added to the tool or process to achieve success?
How fast is the data extracted?