1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
Which challenge are you working on?
How would you categorize the PDFs?
PDF URL | Document Title |
---|---|
http://www.domain.org/docs/docurl.pdf | Report of Economic Data 2012 |
How did you extract the desired data that produced the best results?
ABBYY provides best results for tabular data. Tesseract and etc. unfortunately do not come close (note that this OCR, not text PDFs).
What would have to be changed/added to the tool or process to achieve success?
Account for general structure, different Schedule tables (III, IV, V, etc.), many more improvements to go - only result of 5 hours of research into PDFs.
Please list code, tips and howto's of your processing pipeline.
https://github.com/rtsio/financial_disclosure_scraping/tree/master/ABBYY-working-example