AWS Textract is now out of closed beta. You can read the features page here, and you can also read about its limits here (e.g. no handwriting). Basically, if you've ever had to deal with the hell of getting structured data out of a PDF (scanned image or not), Textract is aiming for your business:
This short gist contains some of my brief observations about Textract and its demo, as well as direct links to the most relevant and important files, such as the Textract demo sample image and the resulting data files from Textract's API. If you have an AWS account, I highly recommend trying the interactive demo for yourself, because it's easy and because you can upload and test a document/image of your own choosing. Here's the URL for the Textract demo:
https://console.aws.amazon.com/textract/home?region=us-east-1#/demo
Note: I did try Textract on a real-world PDF and it...did not do very well. I might write up some details if I have time to do more testing. The PDF was a page from the FBI's National Instant Criminal Background Check System (NICS). I got the idea and the pdf sample from Jeremy Singer-Vine's excellent pdfplumber repo, which includes -- among many useful resources -- a Jupyter notebook showing how to use his pdfplumber library to extract tabular data from the FBI PDF.
However, if you're only casually curious about Textract and prefer to read a short writeup/screenshots of the demo and its data files, then read on:
The Textract demo page conveniently provides a preloaded sample file; here's a screenshot of what the demo page looks like on first visit:
On the left side is an image of the sample file for Textract do process. On the right side is a preview of Textract's analysis (not sure if the results are canned, given that the sample image is canned). In the screenshot above, the preview shows the "Raw text" -- i.e. the exact text strings extracted by Textract's OCR from the sample image.
Textract accepts files in JPEG, PNG, or PDF format. The sample file (seen in the left panel of the demo) is not a PDF file, but an image in PNG format. Basically a screenshot of a Word doc/PDF. This document happens to contain a simple data table and a filled-out form. And here's the actual image as processed by the Textract demo:
https://s3.us-east-1.amazonaws.com/textract-public-assets-us-east-1/default_document.png
It goes without saying that Textract perfectly processes this sample file, including the OCR of the raw text, extracting the data table and its structure, and even the key-value pairs of the document's "form". The right-pane of the demo lets you quickly preview Textract's analysis. For example, clicking the Tables tab shows a friendly preview of the extracted tabular data:
So Textract demo's perfect results aren't interesting. However, the data format and details of Textract's resulting analysis are interesting, so I've attached them to this gist. If you click the Download Results button on the Textract demo site, it sends you a zipped file of the 4 raw data files; I've uploaded a copy of that results zip file to Github and you can download it via this url
But to save you the time of downloading and opening that (admittedly boring) zipfile, I've also attached each of its 4 data files to this gist, so you can view the contents on this same webpage:
(click the filename to jump to the respective file as attached to the gist)
- apiResponse.json: the granular JSON data from Textract's API, which includes a confidence value and bounding-box coordinates for every detected object, including words, lines, and cells.
- keyValues.csv: the list of detected form fields and values -- e.g.
"Full Name: "
and"Jane Doe "
-- as a simple CSV - rawText.txt: a simple stream of every text string detected by Textract's OCR.
- tables.csv: the tabular row/column data as detected by Textract's algorithm.
For the folks hoping Textract is the solution to their pdf-to-csv problems, I suspect the tables.csv will be the most exciting thing. However, like I mentioned at the top of this gist, the only other document I tried with Textract was an annoying and complicated example courtesy of the PDF. I regret to say that Textract failed, and it failed in unexpected and hard-to-mitigate ways. And it's worth noting that ABBYY FineReader for MacOS seems to do a near-perfect job of extraction.
This is not a death sentence for Textract -- I think I can imagine what the problem is, and that it's a very fixable problem for the AWS team weeks/months down the road as they iterate on the product. If I have time later I'll do some more testing and write up the details. But if you want to know about Textract now, then all I can say that Textract, as of a few hours ago, does not seem reliable enough for the specific usecase of the FBI NICS PDF.
But I also saw plenty of promising things in my brief test of Textract. For a service that requires little bespoke programming (i.e. what you'd use jsvine's pdfplumber for) and, even with the FBI PDF, produces admirable OCR results and tabular structure (90% of the time, for now), Textract looks very promising. I haven't used the Google Cloud Vision API in awhile, so I'm not sure how it compares to that.
And I suppose the ultimate real-world test for Textract (and Google Cloud Vision) is something like the Senate financial disclosure forms, which include PDFs consisting of scanned paper forms:
I have an incomplete repo -- dannguyen/abbyy-finereader-ocr-senate where I have a writeup of the issue and examples, and the results from using ABBYY FineReader (for Mac OSX), which like Textract, offers both extraction of OCR-text and tabular-data extraction. I haven't looked at in awhile, but I believe the results were far from reliable -- which was to be expected given how hard those 2 extraction problems are. And I assume Textract will have largely the same kinds of problems...
But for document PDFs like the FBI NICS sample page.) -- i.e. "simple" PDFs that aren't images and don't need OCR -- I, and I think most data wrangling professionals, will be more than estatic if Textract and its rivals offer improved results when it comes to the seemingly mundane need for good tabular data extraction. Right now I'm OK with ABBYY FineReader and its mostly reliable PDF-to-CSV results. But a scriptable API like Textract that provides granular data for diagnosis and additional hacking/tweaking (i.e. apiResponse.json) is going to much more useful for data journalists/analysts in the long run.
Big note: This gist is already too long, so I don't have time to adequately mention the open source libraries like Tabula and Camelot, which also provide powerful and scriptable workflow for pdf-to-csv, and unlike Amazon/Google cloud, provide it for free. But later!
another image to embed: the Textract demo's Tables preview: