Skip to content

Instantly share code, notes, and snippets.

@qbektrix
Forked from dannguyen/faa-333-pdf-gathering.md
Created October 3, 2015 15:07
Show Gist options
  • Save qbektrix/2d4b8cbbba4df63c50b7 to your computer and use it in GitHub Desktop.
Save qbektrix/2d4b8cbbba4df63c50b7 to your computer and use it in GitHub Desktop.
Using wget + grep to explore inconveniently organized federal data (FAA Section 333 Exemptions)

if !database: wget + grep

The Federal Aviation Administration is posting PDFs of the Section 333 exemptions that it grants, i.e. the exemptions for operators who want to fly drones commercially before the FAA finishes its rulemaking. A journalist wanted to look for exemptions granted to operators in a given U.S. state. But the FAA doesn't appear to have an easy-to-read data file to use and doesn't otherwise list exemptions by location of operator.

However, since their exemptions page is just one giant HTML table for listing the PDFs, we can just use wget to fetch all the PDFs, run pdftotext on each file, and then grep to our hearts' delight.

But you should try doing this yourself. Maybe you don't need total control over how the documents are collected and filtered (though you should if you want to do any indepth research while having an easy-to-update mirror of the documents), but I find that being able to interactively and speedily do a full-text search across a document set usually spawns new ideas and discoveries beyond what you had intended to find. The steps below are repeatable using free software for a *nix system (I'm on OS X 10.10 and brew installed wget, xpdf, and ack) and the FAA's site is as robust and good as any to practice data-mining on.

1. Wget the page

You could write a web scraper and carefully parse the links. Or you could just notice that everything you need is in one HTML file and has a .pdf extension and use wget to download only those URLs.

The following wget command will get every file with a pdf extension and store it into a relative directory named faa_333_pdfs. I've used the most verbose versions of the flags in the example below, with --accept, --recursive, and --level being the most important options; check out the wget manual for more information and options.

wget --accept=pdf --recursive --level=1 \
    --no-directories \
    --directory-prefix=faa_333_pdfs \
    https://www.faa.gov/uas/legislative_programs/section_333/333_authorizations/

Warning: You will end up downloading 1,800+ PDFs weighing in at a total of 1.6+ gigabytes

2. Extract texts from PDFs

If you have pdftotext (via xpdf) installed on your system, then you can just batch convert everything to text...here's the command in bash (which is sloppy, you could do it in Python or at least use find...i didn't check to see if any of the filenames were funky)...the following loop runs pdftotext for each filename, then creates a corresponding text file with .txt at the end:

find . -name '*.pdf' -print0 | while read -d '' -r fname; do
  echo $fname   # print the name for reference
  pdftotext -layout "$fname" "$fname.txt"  
done
A few notes
  1. Using for fname in *.pdf; do works in this case, but it's better to be safe with the more verbose find command. I used the StackOverflow answer here. I honestly am terrible at being mindful of the edge cases...Government IT systems seem to be even more picky/limited with filename conventions so it's rare to download a file with nonalphanumeric characters, nevermind one with newlines or null characters. But better safe than sorry.
  2. pdftotext threw up some errors -- eg. Syntax Error: Expected the optional content group list, but wasn't able to find it, or it isn't an Array -- and I didn't look to see what the deal was. You can (and should) re-run pdftotext yourself once you have the PDFs download -- it only takes a few minutes on a modern laptop.
  3. The -layout option outputs the text in a form similar to how it is physically laid out in the PDF (screenshot below). This may or may not be what you really need when doing a full-text, multi-line search:

Imgur

3. Text search

Then you can just grep or whatever...probably easiest to use a text/project editor like Sublime/Atom and do a project-folder search over the text files, for more interactivity...or even just use the OS X Finder's normal search. I'm using ack in the example below, which is basically like grep but with more colors:

Imgur

Obviously, turning this into an accessible data table is not something I recommend doing from the command-line. But maybe you just need to do a quick look-see, in which case, grep is absolutely the way to go. Though I prefer ag, which supports multi-line PCRE regex searching (as does ack)

And of course, check out the cleaned and sorted data from The Verge/ Drone Center at Bard College, if only as a reference point.

4. Zip and upload to S3 or what have you

You don't need to care about this but I do because I frequently forget how to zip and send files to my cloud storage (via AWS CLI):

cd .. # assuming you were in the downloaded-files sub directory
zip -r faa_333_pdfs.zip faa_333_pdfs
aws s3 cp faa_333_pdfs.zip s3://mah-s3-bucket/faa_333_pdfs.zip --acl public-read

Nice practices

I'm not particularly interested in the Section 333 issue but when I heard about the problem of getting the data and saw the FAA's webpage, I was initially annoyed because I thought the table pagination was rendered with AJAX calls. That's not a terrible trend, just one that I don't think government sites should follow for various concerns about accessibility and needless complexity for publishing public info. Also, it's highly irritating when they fail at JavaScript heavy sites -- Santa Clara County's restaurant inspection site comes to mind.

But the FAA site is not too bad. You can't direct link to "Page 12 of 100 search results"...but that's probably less important to most people than the highly responsive search box that quickly filters the list with each keystroke. And if JavaScript is unavailable or disabled, you can still Ctrl-F (if you know how) across the entire HTML table.

And of course, one big static HTML page is just perfect for wget's --accept option, which I think is a more ideal tool for this situation compared to the (admittedly useful) DownThemAll plugin. And running wget on a public government site to mirror pages or grab documents they've published is generally permissable -- though I've run into a few federal websites that will block wget unless you change the default user agent. This list of interesting datasets for computational journalists contains a few examples of government one-page file lists suitable for wgetting:

However, most of these multi-part databases require a significant amount of scripting post-download to assemble together -- here's my Bash and R gist for an older version of the NYPD stop and frisk dataset -- so using wget to avoid writing a scraper is probably not going to save you much time if you want to do anything besides grep for strings in delimited text files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment