ryankanno/sample_irs990_curl_xml2json_jq.md

Last active February 6, 2023 22:38

Star (3) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/ryankanno/a5da4c6f1f8e0136db9623ae1903d23d.js"></script>
Save ryankanno/a5da4c6f1f8e0136db9623ae1903d23d to your computer and use it in GitHub Desktop.

Raw

Instructions

These instructions will help you better analyze the IRS 990 public dataset. The first thing you'll want to do is to read through the documentation over at Amazon. There's a ~108MB index file called index.json.gz that contains metadata describing the entire corpus.

To download the index.json.gz metadata file, you'll want to issue the following command: curl https://s3.amazonaws.com/irs-form-990/index.json.gz. Once you've downloaded the index.json.gz file, you can extract its contents with the following command: gunzip index.json.gz. To take a peek at the extracted contents, use the following command: head index.json.

Looking at the index.json file, you'll notice that it contains a json structure represented as a string. It contains an array of json objects that look like the following:

{"EIN": "721221647", "SubmittedOn": "2016-02-05", "TaxPeriod": "201412", "DLN": "93493309001115", "LastUpdated": "2016-03-21T17:23:53", "URL": "https://s3.amazonaws.com/irs-form-990/201513099349300111_public.xml", "FormType": "990", "ObjectId": "201513099349300111", "OrganizationName": "FERBER FAMILY OF HOUMA FNDTN CO JEWISH ENDOWMENT FOUNDATION", "IsElectronic": true, "IsAvailable": true}

Each of these records represents if an electronic 990 filing is available for an organization. If an organization's electronic 990 filing is not available, you'll notice that "IsAvailable" will be set to false and it will be missing a "URL" entry.

Once you've figured out what entries you'd like to look at, please follow the below instructions to install tools that can help you further analyze the data.

As a note, you don't need to download the entire corpus to perform your analysis nor do you need an Amazon account.

Install the following

Form 990

Form 990 2013 / 2014

Pull the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp for a single curl call

curl -s https://s3.amazonaws.com/irs-form-990/201541349349307794_public.xml | xml2json | jq -c '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990.Form990PartVIISectionAGrp | .[] |
    {company: $companyName, name:.PersonNm."$t", title: .TitleTxt."$t", hours_per_week: .AverageHoursPerWeekRt."$t",
    total_comp: ((.ReportableCompFromOrgAmt."$t"|tonumber) + (.ReportableCompFromRltdOrgAmt."$t"|tonumber) +
    (.OtherCompensationAmt."$t"|tonumber))}'

Create csv friendly output including the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp

for f in *.xml; do xml2json < $f | jq -r '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990.Form990PartVIISectionAGrp | map([$companyName, .PersonNm."$t", .TitleTxt."$t", 
    (.AverageHoursPerWeekRt."$t"|tonumber), ((.ReportableCompFromOrgAmt."$t"|tonumber) +
    (.ReportableCompFromRltdOrgAmt."$t"|tonumber) + (.OtherCompensationAmt."$t"|tonumber))]) | .[] | @csv'; done

Create a csv file called "990.csv" that includes the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp

for f in *.xml; do xml2json < $f | jq -r '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990.Form990PartVIISectionAGrp | map([$companyName, .PersonNm."$t", .TitleTxt."$t", 
    (.AverageHoursPerWeekRt."$t"|tonumber), ((.ReportableCompFromOrgAmt."$t"|tonumber) +
    (.ReportableCompFromRltdOrgAmt."$t"|tonumber) + (.OtherCompensationAmt."$t"|tonumber))]) | .[] | @csv'; done > 990.csv

Form 990 2012

Pull the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionA for a single curl call

curl -s https://s3.amazonaws.com/irs-form-990/201302899349300950_public.xml | xml2json | jq -c '
    .Return.ReturnHeader.Filer.Name.BusinessNameLine1."$t" as
    $companyName | .Return.ReturnData.IRS990.Form990PartVIISectionA | .[] |
    {company: $companyName, name:.NamePerson."$t", title: .Title."$t", hours_per_week: .AverageHoursPerWeek."$t"|tonumber,
    total_comp: ((.ReportableCompFromOrganization."$t"|tonumber) + (.ReportableCompFromRelatedOrgs."$t"|tonumber) +
    (.OtherCompensation."$t"|tonumber))}'

Form 990EZ

Form 990EZ 2013 / 2014

Pull the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp for a single curl call

curl -s https://s3.amazonaws.com/irs-form-990/201502469349200225_public.xml | xml2json | jq -c '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990EZ.OfficerDirectorTrusteeEmplGrp | .[] |
    {company: $companyName, name:.PersonNm."$t", title: .TitleTxt."$t", hours_per_week: .AverageHrsPerWkDevotedToPosRt."$t"|tonumber,
    compensation_amount: (.CompensationAmt."$t"|tonumber)}'

Create csv friendly output including the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp

for f in *.xml; do xml2json < $f | jq -r '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990EZ.OfficerDirectorTrusteeEmplGrp  | map([$companyName, .PersonNm."$t", .TitleTxt."$t", (.AverageHrsPerWkDevotedToPosRt."$t"|tonumber), (.CompensationAmt."$t"|tonumber)]) | .[] | @csv'; done

Create a csv file called "990EZ.csv" that includes the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp

for f in *.xml; do xml2json < $f | jq -r '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990EZ.OfficerDirectorTrusteeEmplGrp  | map([$companyName, .PersonNm."$t", .TitleTxt."$t", (.AverageHrsPerWkDevotedToPosRt."$t"|tonumber), (.CompensationAmt."$t"|tonumber)]) | .[] | @csv'; done > 990EZ.csv

johnhawkinson commented Jun 21, 2016

Ryan:

¶ Interesting. With respect to the rules and background for this data, the IRS publishes a huge amount of detailed instructions, and they are well worth reading (or at least searching) to find some of the reasoning. Be careful because the rules sometimes change from year-to-year. For the example return you cited, https://s3.amazonaws.com/irs-form-990/201541349349307794_public.xml, Tax Year 2014 instructions are https://www.irs.gov/pub/irs-prior/i990--2014.pdf and Schedule J is https://www.irs.gov/pub/irs-prior/i990--2014.pdf.

I haven't run across cases where individuals were omitted from Sched. J but are in Part VII (or have forgotten it!), though glancing at the main form instructions, that would appear to be pp.26-27:

Reportable compensation. Reportable compensation
consists of:
...
• For institutional trustees, fees for services paid pursuant to
a contractual agreement or statutory entitlement. While the
compensation of institutional trustees must be reported on Form
990, Part VII, it need not be reported on Schedule J (Form 990).

My logic in suggesting using the aggregated number is that this saves you having to carefully read the rules and understand how compensation is broken down, and if there's a change in the number of fields that are added to reach the total (unlikely!), your numbers would change.

On the other hand, there's also a good argument that the total compensation includes "other reportable compensation" which is sometimes goofy. In particular it can include deferred compensation packages associated with retirement and if you quote that number, you run the risk of people saying "That's not my salary! That's not what's on my W2!". So it does pay to understand the component numbers.

¶ Sorry, I was unclear. I just think it's important to not imply there might be a barrier to entry of having to download gigabytes of data. Already there is confusion over whether the data is publicly accessible (e.g. some people seemed to think that you needed an AWS account to access it, and didn't understand it was available over HTTP as well). I wouldn't want someone to read this and conclude "Oh, I need to download ALL the data in order to do anything with it?" Of course there's plenty of interesting stuff that can only be done with the entire corpus of data.

Though if you're running this on the entire corpus of all returns, performance starts to matter. I would think that XSLT would be faster than converting from XML to JSON and then using 'jq', though definitely that's premature optimization :)

Author

ryankanno commented Jun 21, 2016

John -

Thanks so much for the info! :D As to your question - I'm not sure what states you're looking at, but I've found quite a bit of them that follow pp 26-27 (where the institutional trustees aren't reported on Schedule J) - as an example - funny enough, the random example document I selected is one of them. Thanks for clearing up your point - I've edited the gist above to reflect the notion that you don't need the entire corpus.

lecy commented Jul 15, 2016

Ryan -

This is very helpful!

How long is it taking to process a query that creates a CSV from a few variables? Do your scripts above filter by form type, or do you have to do that post-hoc?

I am working through some similar scripts in R. You can grab data by specifying a set of orgs, but it scrapes from the XML pages and is quite slow.

https://github.com/lecy/Open-Data-for-Nonprofit-Research/blob/master/README.md

I'm trying to figure out if there is a way to download the data and run the queries locally to improve speed, but I'm not very familiar with XML and the AWS Cloud system. Could I run a curl call and download the full XML database to my machine?

Jesse

Author

ryankanno commented Jul 23, 2016

@lecy -

My apologies that I missed this; for some reason I don't have notifications being sent when someone messages me on a gist. I have a Go script that can download all the data (and sorts them into this directory bucket -> state/year/form_type (irs990|irs990ez,etc). Let me know if you want me to shoot it over to you. The entire dataset is ~60 GB.

lecy commented Jul 24, 2016

Yes, that would be awesome!

johnhawkinson commented Jul 27, 2016

I wasn't watching this gist either! For what it's worth, the easiest way to get it all is probably to use the s3 tools:

aws s3 cp s3://irs-form-990/ /path/to/local/dir --recursive

johnhawkinson commented Jul 28, 2016

BTW, rereading the 2nd graf, did you mean to write curl -O or something? Won't just plain curl ${url} output the gzipped file to stdout, undesirably?

jedsundwall commented Oct 27, 2016

FYI, the index.json file is no longer available. Instead, yearly index.json files are available at

rubyshoes commented Dec 13, 2016

@jedsundwall -
Are they adding to the index_2016.json? The reason I ask is that I cannot find the name of a company that I was looking for. Thanks for the index files.

ryankanno/sample_irs990_curl_xml2json_jq.md

Instructions

Install the following

Examples

Form 990

Form 990 2013 / 2014

Pull the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp for a single curl call

Create csv friendly output including the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp

Create a csv file called "990.csv" that includes the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp

Form 990 2012

Pull the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionA for a single curl call

Form 990EZ

Form 990EZ 2013 / 2014

Pull the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp for a single curl call

Create csv friendly output including the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp

Create a csv file called "990EZ.csv" that includes the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp

johnhawkinson commented Jun 21, 2016

Uh oh!

ryankanno commented Jun 21, 2016

Uh oh!

lecy commented Jul 15, 2016

Uh oh!

ryankanno commented Jul 23, 2016

Uh oh!

lecy commented Jul 24, 2016

Uh oh!

johnhawkinson commented Jul 27, 2016

Uh oh!

johnhawkinson commented Jul 28, 2016

Uh oh!

jedsundwall commented Oct 27, 2016

Uh oh!

rubyshoes commented Dec 13, 2016

Uh oh!