[Ignore this gist, checkout the github] Testing AWS Textract's ability to correctly extract data tables from a difficult FBI stats report PDF
Update: I've since realized that this writeup would be far easier to do as its own Github repo, given the number of files involved. Please ignore this gist which I'm keeping here as a backup, and instead, visit: https://github.com/dannguyen/aws-textract-pdf-to-csv-demo
tl;dr: pretty good table structure overall, given the issues with the original PDF. However, there were inexplicable and critical data errors, as if Textract converted the PDF to an image, OCRed it, and then attempted to extract the data tables.
Amazon Textract was announced about 6 months ago but was made public today (May 29). If have an AWS account, you can check out Textract's point-and-click demo, which allows you to upload an image or PDF for Textract to process with optical character recognition (OCR) and data table extraction. Obviously, the OCR ability is going to be a huge feature for many folks dealing with trying to extract data from scanned documents. However, Textract's capabilities in data table extraction is the focus of this gist)
The demo is at this URL: https://console.aws.amazon.com/textract/home?region=us-east-1#/demo
The Textract demo page conveniently features a sample document with a data table that – surprise, surprise – Textract manages to extract perfectly; here's a screenshot of what the interface looks like, including the preview of the tabular extraction:

(Note: I can't tell if the sample doc was processed as an image file (and thus required actual OCR functionality), or as a regular text PDF. And I was too lazy to poke around the Textract API docs to figure it out)
Again, fair warning: for this gist, I'm mostly/only interested in seeing how good Textract is when it comes to extracting data tables from non-image PDFs. In other words, I don't really care about Textract's OCR's capability for now, though you can read Mozilla Source's roundup of OCR options (which didn't include Textract at time of publication) to get a taste of how complex and painful the OCR problem is on its own. In other other words, we can assume that if data table extraction is hard, then data table extraction on the results of OCRed document is an additional level of of clusterfuck complexity; if Textract can just get data table extraction right, that will be a huge victory for data folks on its ownn
PDF is a great format for when you need a digital document that – unlike the vast majority of webpages and spreadsheets – will essentially look the same to anyone else who opens it on any computer, and/or wants to print it out on paper. PDFs, like Word documents, can contain not just standard prose, but data tables – such as copy-pasting from an Excel spreadsheet into Word, and then saving as PDF.
But for various technical reasons, extracting the data table from a PDF is often not as easy as copy-pasting from PDF into Excel. In fact, the metadata of the data's layout and structure is usually destroyed and irrecoverable when data tables are saved as PDF documents. It's a hard enough problem that it's one of the only situations in which I've given up on trying to hack it myself and settled for a commercial software package that is non-unscriptable (i.e. have to use by manual point-and-click): $99 for ABBYY FineReader – which is still far-from-perfect but good enough, all things considered.
Note: If you are interested in more technical details about on why extracting data tables from PDFs is so complicated, I highly recommend the following resources:
- Heart of Nerd Darkness: Why Updating Dollars for Docs Was So Difficult, by ProPublica's Jeremy Merrill, which is an excellent and detailed overview of the PDF problem and how it applies to a real-world data investigation.
- Introducing Tabula, by the authors of the Knight Mozilla-supported open source Tabula project, because PDF-to-CSV is really that huge of a data problem for journalists.
- Introduction to The Camelot Project, excellent documentation and writeup by another open source PDF-to-CSV extraction library and tool, because PDF-to-CSV is a massive problem for everyone who works with documents and data
- One of the many, many Hacker News discussions about how awesome it would be if someone could create a good PDF-to-CSV tool
In any case, rather than try Textract on a real-world-but-simple PDF, I decided to upload one of the most annoying government data-as-PDF examples I've seen: a report from the FBI's National Instant Criminal Background Check System, which Jeremy Singer-Vine uses as an example (complete with Jupyter notebook) to demonstrate the pdfplumber library he wrote (yes, yet another open source PDF-to-CSV project, because it really is such a painful and critical problem). You can read Singer-Vine's helpful Python notebook showing the code and the PDF. Or better yet, you can read Singer-Vine's separate writeup of the FBI data (one of the best web-scraping-for-journalism examples I've seen). Or even even better, read a story that Singer-Vine's BuzzFeed News colleague, Peter Aldhous, wrote based on the data: Under Trump, Gun Sales Did Not Spike After The Las Vegas Shooting.
Suffice to say, the background-checks.pdf supplied by pdfplumber is a very nice example of a data-stuck-in-PDF – it has real-world importance and is actually, at first glance, very readable, but contains enough technical and epistemological issues to be a hard challenge to both humans and automated software when it comes to extracting the data.
So I uploaded the single-page example backgrounds-checks.pdf to Textract. After about a minute, I was able to download a zip which contained 4 files, 3 of which I've attached to this gist in their raw forms:
- apiResponse.json (3.6MB): contains the data behind Textract's decision at the granular per-word/character/line level, including the exact bounding boxes and confidence values for each text extraction.
- keyValues.csv (empty): Not sure what this is supposed to be but it was empty
- rawText.txt (5 KB): just the extracted results as a stream of unformatted plaintext
- tables.csv (9 KB): the tabular data that Textract found, in spreadsheet-ready format
tables.csv is the thing we obviously care about. You can download it and open it in Excel yourself, but for your convenience, I've also uploaded it to Google Sheets. Or you can just read my quick summations and screenshots below:
Here's a screenshot of the top half of the table, and all of its columns, to show that Textract did quite well in not only getting the right number of columns, but also dealing with the confusing grouped headers
Pasting images (via screenshot and clipboard) here so I can get URLs to embed in the actual gist:
Textract demo on AWS console:
https://console.aws.amazon.com/textract/home?region=us-east-1#/demo
Screenshot of the NICS Firearm Background Checks featured on pdfplumber examples: https://github.com/jsvine/pdfplumber/blob/master/examples/pdfs/background-checks.pdf
Screenshot of the tables.csv file that Textract returns when processing the background-checks.pdf:
