Instructions

These instructions will help you better analyze the IRS 990 public dataset. The first thing you'll want to do is to read through the documentation over at Amazon. There's a ~108MB index file called index.json.gz that contains metadata describing the entire corpus.

To download the index.json.gz metadata file, you'll want to issue the following command: curl https://s3.amazonaws.com/irs-form-990/index.json.gz. Once you've downloaded the index.json.gz file, you can extract its contents with the following command: gunzip index.json.gz. To take a peek at the extracted contents, use the following command: head index.json.

Looking at the index.json file, you'll notice that it contains a json structure represented as a string. It contains an array of json objects that look like the following:

{"EIN": "721221647", "SubmittedOn": "2016-02-05", "TaxPeriod": "201412", "DLN": "93493309001115", "LastUpdated": "2016-03-21T17:23:53", "URL": "https://s3.amazonaws.com/irs-form-990/201513099349300111_public.xml", "FormType": "990", "ObjectId": "201513099349300111", "OrganizationName": "FERBER FAMILY OF HOUMA FNDTN CO JEWISH ENDOWMENT FOUNDATION", "IsElectronic": true, "IsAvailable": true}

Each of these records represents if an electronic 990 filing is available for an organization. If an organization's electronic 990 filing is not available, you'll notice that "IsAvailable" will be set to false and it will be missing a "URL" entry.

Once you've figured out what entries you'd like to look at, please follow the below instructions to install tools that can help you further analyze the data.

As a note, you don't need to download the entire corpus to perform your analysis nor do you need an Amazon account.

Install the following

Form 990

Form 990 2013 / 2014

Pull the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp for a single curl call

curl -s https://s3.amazonaws.com/irs-form-990/201541349349307794_public.xml | xml2json | jq -c '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990.Form990PartVIISectionAGrp | .[] |
    {company: $companyName, name:.PersonNm."$t", title: .TitleTxt."$t", hours_per_week: .AverageHoursPerWeekRt."$t",
    total_comp: ((.ReportableCompFromOrgAmt."$t"|tonumber) + (.ReportableCompFromRltdOrgAmt."$t"|tonumber) +
    (.OtherCompensationAmt."$t"|tonumber))}'

Create csv friendly output including the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp

for f in *.xml; do xml2json < $f | jq -r '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990.Form990PartVIISectionAGrp | map([$companyName, .PersonNm."$t", .TitleTxt."$t", 
    (.AverageHoursPerWeekRt."$t"|tonumber), ((.ReportableCompFromOrgAmt."$t"|tonumber) +
    (.ReportableCompFromRltdOrgAmt."$t"|tonumber) + (.OtherCompensationAmt."$t"|tonumber))]) | .[] | @csv'; done

Create a csv file called "990.csv" that includes the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp

for f in *.xml; do xml2json < $f | jq -r '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990.Form990PartVIISectionAGrp | map([$companyName, .PersonNm."$t", .TitleTxt."$t", 
    (.AverageHoursPerWeekRt."$t"|tonumber), ((.ReportableCompFromOrgAmt."$t"|tonumber) +
    (.ReportableCompFromRltdOrgAmt."$t"|tonumber) + (.OtherCompensationAmt."$t"|tonumber))]) | .[] | @csv'; done > 990.csv

Form 990 2012

Pull the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionA for a single curl call

curl -s https://s3.amazonaws.com/irs-form-990/201302899349300950_public.xml | xml2json | jq -c '
    .Return.ReturnHeader.Filer.Name.BusinessNameLine1."$t" as
    $companyName | .Return.ReturnData.IRS990.Form990PartVIISectionA | .[] |
    {company: $companyName, name:.NamePerson."$t", title: .Title."$t", hours_per_week: .AverageHoursPerWeek."$t"|tonumber,
    total_comp: ((.ReportableCompFromOrganization."$t"|tonumber) + (.ReportableCompFromRelatedOrgs."$t"|tonumber) +
    (.OtherCompensation."$t"|tonumber))}'

Form 990EZ

Form 990EZ 2013 / 2014

Pull the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp for a single curl call

curl -s https://s3.amazonaws.com/irs-form-990/201502469349200225_public.xml | xml2json | jq -c '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990EZ.OfficerDirectorTrusteeEmplGrp | .[] |
    {company: $companyName, name:.PersonNm."$t", title: .TitleTxt."$t", hours_per_week: .AverageHrsPerWkDevotedToPosRt."$t"|tonumber,
    compensation_amount: (.CompensationAmt."$t"|tonumber)}'

Create csv friendly output including the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp

for f in *.xml; do xml2json < $f | jq -r '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990EZ.OfficerDirectorTrusteeEmplGrp  | map([$companyName, .PersonNm."$t", .TitleTxt."$t", (.AverageHrsPerWkDevotedToPosRt."$t"|tonumber), (.CompensationAmt."$t"|tonumber)]) | .[] | @csv'; done

Create a csv file called "990EZ.csv" that includes the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp

for f in *.xml; do xml2json < $f | jq -r '
    .Return.ReturnHeader.Filer.BusinessName.BusinessNameLine1Txt."$t" as
    $companyName | .Return.ReturnData.IRS990EZ.OfficerDirectorTrusteeEmplGrp  | map([$companyName, .PersonNm."$t", .TitleTxt."$t", (.AverageHrsPerWkDevotedToPosRt."$t"|tonumber), (.CompensationAmt."$t"|tonumber)]) | .[] | @csv'; done > 990EZ.csv

ryankanno/sample_irs990_curl_xml2json_jq.md

Instructions

Install the following

Examples

Form 990

Form 990 2013 / 2014

Pull the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp for a single curl call

Create csv friendly output including the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp

Create a csv file called "990.csv" that includes the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionAGrp

Form 990 2012

Pull the business name, person name, title, hours_per_week, and total comp from the Form990PartVIISectionA for a single curl call

Form 990EZ

Form 990EZ 2013 / 2014

Pull the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp for a single curl call

Create csv friendly output including the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp

Create a csv file called "990EZ.csv" that includes the business name, person name, title, hours_per_week, and total comp from the OfficerDirectorTrusteeEmplGrp

ryankanno commented Jun 21, 2016

Uh oh!

lecy commented Jul 15, 2016

Uh oh!

ryankanno commented Jul 23, 2016

Uh oh!

lecy commented Jul 24, 2016

Uh oh!

johnhawkinson commented Jul 27, 2016

Uh oh!

johnhawkinson commented Jul 28, 2016

Uh oh!

jedsundwall commented Oct 27, 2016

Uh oh!

rubyshoes commented Dec 13, 2016

Uh oh!