Last active
October 6, 2020 21:23
-
-
Save gdsaxton/5ddb3b520d73665af4b19ff2b368c463 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Overview\n", | |
"\n", | |
"This is a modified version of the <a href=\"http://social-metrics.org/irs-990-e-file-data-part-4/\">fourth in a series of tutorials</a> that illustrate how to download the IRS 990 e-file data available at https://aws.amazon.com/public-data-sets/irs-990/\n", | |
"\n", | |
"Specifically, in this notebook we will download into a MongoDB table not just the IRS 990 filings for our sample of 5 organizations, but also the associated \"schedules.\" The first half of the tutorial is the same as part (4), so you can skip over it if you're comfortable with the material." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Load Packages" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import sys\n", | |
"import time\n", | |
"import json" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"/Users/gsaxton/Dropbox/990 e-file data\n" | |
] | |
} | |
], | |
"source": [ | |
"cd '/Users/gsaxton/Dropbox/990 e-file data'" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### MongoDB\n", | |
"Depending on the project, I will store the data in SQLite or MongoDB. This time I'll use MongoDB -- it's great for storing JSON data where each observation could have different variables. Before we get to the interesting part the following code blocks set up the MongoDB environment and the new database we'll be using. \n", | |
"\n", | |
"**_Note:_** In a terminal we'll have to start MongoDB by running the command *mongod* or *sudo mongod*. Then we run the following code block here to access MongoDB." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import pymongo\n", | |
"from pymongo import MongoClient\n", | |
"client = MongoClient()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>Connect to our database that contains the filing index data we downloaded earlier." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# DEFINE MY mongoDB DATABASE\n", | |
"db = client['irs_990_db']\n", | |
"\n", | |
"# DEFINE MY COLLECTION WHERE I'LL INSERT MY SEARCH \n", | |
"file_list = db['990_files_2011_2018']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[u'_id_', u'ObjectId_1']" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"list(db.file_list.index_information())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>Check how many observations in the database table." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2373310" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"file_list.count()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Loop over List of Filings, grab e-file data, and insert into second database\n", | |
"First we'll write a function to turn an ordered dictionary (which is what is returned by *xmltodict*) into a normal Python dictionary so that we can combine it with the filing data gathered above.\n", | |
"\n", | |
"What I'm doing in the next five or so blocks of code is looping over the *list* of 3 million plus filings that I've previously downloaded, and then downloading the full e-file 990 filings for all orgs that match one of our 8,304 EINs." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from json import loads, dumps\n", | |
"from collections import OrderedDict\n", | |
"\n", | |
"def to_dict(input_ordered_dict):\n", | |
" return loads(dumps(input_ordered_dict))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>Let's create a list of five EINs. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"eins = ['010202467', '010211478', '010211513', '010211530', '010211543']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>\n", | |
"Check how many filings there are for this sample of 5 EINs. Here we are access the <code>file_list</code> collection and counting the number of rows in the database that match the EINs in the above list. There is more than one filing per organization given the multiple years in our data. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"33" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"file_list.find({'EIN': { '$in': eins}}).count()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>We can loop over each of these rows in our MongoDB collection. Let's show the first filing." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"{u'OrganizationName': u'MAINE HISTORICAL SOCIETY', u'ObjectId': u'201100709349300510', u'URL': u'https://s3.amazonaws.com/irs-form-990/201100709349300510_public.xml', u'SubmittedOn': u'2011-03-22', u'DLN': u'93493070005101', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201009', u'FormType': u'990', u'_id': ObjectId('5ad8e83435fd3fae98d6464a'), u'EIN': u'010211530'}\n" | |
] | |
} | |
], | |
"source": [ | |
"for file in file_list.find({'EIN': { '$in': eins}})[:1]:\n", | |
" print file" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>To recall, our current dataset contains basic details on all 2,373,310 990 filings. We still don't have the actual 990 data, however. To get that, we will have to pick which filings we want and then access the *URL* column for that filing as seen in our dataset. What we are going to want to do later on is loop over all 33 rows in our database and access each filing by visiting the link shown in the *URL* field. The code block below prints out the URL for the first filing by our list of 5 EINS. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"https://s3.amazonaws.com/irs-form-990/201100709349300510_public.xml\n" | |
] | |
} | |
], | |
"source": [ | |
"for file in file_list.find({'EIN': { '$in': eins}})[:1]:\n", | |
" print file['URL']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>To access the filing we are going to use a Python library called <code>urllib2</code>. In the following code block we access the first matching row in our MongoDB collection, visit the *URL* and assign the page data to a variable called *url_data*, read in the data and assign it to a variable called *f_string*, and then print out the first 1000 characters of data. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"https://s3.amazonaws.com/irs-form-990/201100709349300510_public.xml\n", | |
"<?xml version=\"1.0\" encoding=\"utf-8\"?>\n", | |
"<Return xmlns=\"http://www.irs.gov/efile\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.irs.gov/efile\" returnVersion=\"2009v1.7\">\n", | |
" <ReturnHeader binaryAttachmentCount=\"0\">\n", | |
" <Timestamp>2011-03-11T11:03:33-06:00</Timestamp>\n", | |
" <TaxPeriodEndDate>2010-09-30</TaxPeriodEndDate>\n", | |
" <PreparerFirm>\n", | |
" <PreparerFirmBusinessName>\n", | |
" <BusinessNameLine1>MARSHALL & LIBBY LLC</BusinessNameLine1>\n", | |
" </PreparerFirmBusinessName>\n", | |
" <PreparerFirmUSAddress>\n", | |
" <AddressLine1>2367 CONGRESS ST</AddressLine1>\n", | |
" <City>PORTLAND</City>\n", | |
" <State>ME</State>\n", | |
" <ZIPCode>041021932</ZIPCode>\n", | |
" </PreparerFirmUSAddress>\n", | |
" </PreparerFirm>\n", | |
" <ReturnType>990</ReturnType>\n", | |
" <TaxPeriodBeginDate>2009-10-01</TaxPeriodBeginDate>\n", | |
" <Filer>\n", | |
" <EIN>010211530</EIN>\n", | |
" <Name>\n", | |
" <BusinessNameLine1>MAINE HISTORICAL SOCIETY</BusinessNameLine1>\n", | |
" </Name>\n", | |
" <\n" | |
] | |
} | |
], | |
"source": [ | |
"import xmltodict\n", | |
"import urllib2\n", | |
"for file in file_list.find({'EIN': { '$in': eins}})[:1]:\n", | |
" print file['URL']\n", | |
" url_data = urllib2.urlopen(file['URL'])\n", | |
" f_string = url_data.read()\n", | |
" print f_string[:1000]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>The above data are in XML format. We will need to convert them to JSON format in order to insert into MongoDB. For that we will leverage the Python module <code>xmltodict</code>." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"https://s3.amazonaws.com/irs-form-990/201100709349300510_public.xml\n" | |
] | |
} | |
], | |
"source": [ | |
"import xmltodict\n", | |
"import urllib2\n", | |
"for file in file_list.find({'EIN': { '$in': eins}})[:1]:\n", | |
" print file['URL']\n", | |
" url_data = urllib2.urlopen(file['URL'])\n", | |
" f_string = url_data.read()\n", | |
" data = xmltodict.parse(f_string)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>Now the data are in a dictionary format. Let's check which keys are present." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[u'Return']" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data.keys()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>We see there is nothing relevant there, so let's drop down a level." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[u'@xmlns',\n", | |
" u'@xmlns:xsi',\n", | |
" u'@xsi:schemaLocation',\n", | |
" u'@returnVersion',\n", | |
" u'ReturnHeader',\n", | |
" u'ReturnData']" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data['Return'].keys()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>The first four keys do no contain much useful information." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"u'http://www.irs.gov/efile'" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data['Return']['@xmlns']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"u'http://www.w3.org/2001/XMLSchema-instance'" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data['Return']['@xmlns:xsi']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"u'http://www.irs.gov/efile'" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data['Return']['@xsi:schemaLocation']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"u'2009v1.7'" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data['Return']['@returnVersion']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>The <code>ReturnHeader</code> key contains 12 sub-keys. Here we have information that could be useful for certain research projects. For most purposes, however, we can skip this." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[u'@binaryAttachmentCount',\n", | |
" u'Timestamp',\n", | |
" u'TaxPeriodEndDate',\n", | |
" u'PreparerFirm',\n", | |
" u'ReturnType',\n", | |
" u'TaxPeriodBeginDate',\n", | |
" u'Filer',\n", | |
" u'Officer',\n", | |
" u'Preparer',\n", | |
" u'TaxYear',\n", | |
" u'BuildTS']" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data['Return']['ReturnHeader'].keys()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>Lastly, we inspect the <code>ReturnData</code> key. The first key simply provides a count of the number of documents (in this case, 7) as well as keys holding the 990 return data and any of the associated \"schedules.\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[u'@documentCount',\n", | |
" u'IRS990',\n", | |
" u'IRS990ScheduleA',\n", | |
" u'IRS990ScheduleB',\n", | |
" u'IRS990ScheduleD',\n", | |
" u'IRS990ScheduleM',\n", | |
" u'IRS990ScheduleO',\n", | |
" u'IRS990ScheduleR']" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data['Return']['ReturnData'].keys()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>In almost all cases, what we want is the 990 data. As we can see here, there are 207 keys nested under the <code>IRS990</code> key. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"# of keys in 990 data: 207 \n", | |
"\n", | |
"First 5 keys: [u'@documentId', u'@referenceDocumentId', u'NameOfPrincipalOfficerPerson', u'AddressPrincipalOfficerUS', u'GrossReceipts']\n" | |
] | |
} | |
], | |
"source": [ | |
"print \"# of keys in 990 data:\", len(data['Return']['ReturnData']['IRS990'].keys()), '\\n'\n", | |
"print \"First 5 keys:\", data['Return']['ReturnData']['IRS990'].keys()[:5]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>In many cases, such as the fourth key above, <code>AddressPrincipalOfficerUS</code>, there are multiple keys nested under it. Depending on your data needs you can \"flatten\" these data as desired." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"OrderedDict([(u'AddressLine1', u'485 CONGRESS STREET'),\n", | |
" (u'City', u'PORTLAND'),\n", | |
" (u'State', u'ME'),\n", | |
" (u'ZIPCode', u'04101')])" | |
] | |
}, | |
"execution_count": 25, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data['Return']['ReturnData']['IRS990']['AddressPrincipalOfficerUS']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>In brief, what we want to do is write code that will loop over all relevant rows in our MongoDB collection, visit the respective URL where the 990 data are located, grab those data, convert them to a dictionary, and then insert into a new MongoDB collection. \n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Set up new MongoDB collection\n", | |
"In this second collection we will be inserting the actual 990 data we will be downloading." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# DEFINE MY COLLECTION (DATABASE TABLE) WHERE I'LL INSERT THE MESSAGES \n", | |
"filings = db['filings_test']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>Set a unique constraint on the collection based on *URL*. This will avert duplicates from being inserted." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"u'URL_1'" | |
] | |
}, | |
"execution_count": 27, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"db.filings.create_index([('URL', pymongo.ASCENDING)], unique=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[u'_id_', u'URL_1']" | |
] | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"list(db.filings.index_information())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>There are no filings yet." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0" | |
] | |
}, | |
"execution_count": 29, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"filings.count()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Download IRS 990 Data and Available Schedules\n", | |
"Now let's run the loop for all five EINs, grab the 33 IRS 990 filings, and insert them into our new MongoDB table. This block has some additional code that I won't discuss in detail (see comments below for further details). The short answer is that we are looping over each row in our database, visiting the URL that contains the 990 data, and then grabbing all of the data returned by the <code>IRS990</code> key. For convenience purposes, we then combine this new data with the associated filing index data from our first database, and then insert the combined data into our new <code>filings</code> collection.\n", | |
"\n", | |
"Note that this code block will only work for organizations that have the <code>IRS990</code> key. The check is found in the following line of code:\n", | |
"\n", | |
" <code>if 'IRS990' in data['Return']['ReturnData']:</code>\n", | |
"\n", | |
"This means that organizations filing *990EZ* or *990PF* will be skipped. However, the code block could easily be modified to grab 990EZ or 990PF filings. \n", | |
"\n", | |
"Compared to the <a href=\"http://social-metrics.org/irs-990-e-file-data-part-4/\">simpler version of this tutorials</a>, we are also adding columns for five keys that are nested under the ['Return'] key. If you're not familiar with Python this line of code from the code block we'll be using will seem perplexing:\n", | |
"\n", | |
" <code>return_info = {k:v for k,v in datax['Return'].iteritems() if k not in ('ReturnData')}</code>\n", | |
"\n", | |
"What this code does is assign to a new dictionary called *return_info* every key and value nested under <code>datax['Return']</code> except for <code>datax['Return]['ReturnData']</code>. The latter contains our 990 data, so we are going to deal with that separately. So, *return_info* will contain one column for each of the following keys: <code>'@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', '@returnVersion', 'ReturnHeader'</code>. We are not likely to use these data but it will be good to have them handy in case we need them. \n", | |
"\n", | |
"Similarly, with the following line we create a new dictionary that will contain all of the keys nested under <code>datax['Return']['ReturnData']</code> except for <code>datax['Return]['ReturnData']['IRS990]</code>. \n", | |
"\n", | |
" <code>schedules = {k:v for k,v in datax['Return']['ReturnData'].iteritems() if k not in ('IRS990')}</code>\n", | |
"\n", | |
"We then combine the 9 columns from our existing MongoDB collection, the 5 columns from *return_info*, the available columns in the dictionary *schedules*, and the 200+ columns from <code>datax['Return]['ReturnData']['IRS990]</code>. This combined dictionary is then added to our new MongoDB *filings* collection.\n", | |
"\n", | |
"What we are doing here is deciding not to \"flatten\" the data available in the schedules. Instead, the focus of this dataset is the actual core 990 data. Our resultant dataset will have a dozen or so \"background\" or filing detail columns, a half-dozen or so columns containing data on any available schedules, and then 200+ columns containing the 990 data. For most research purposes this will suffice and will minimize the need to flatten keys with nested data. But whenever we have the need the data will be there waiting for us. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"('counter: ', 33, u'BOYS AND GIRLS CLUBS OF SOUTHERN MAINE') \n", | |
"\n", | |
"# of minutes: 0.264805483818 \n", | |
"\n", | |
"# of filings added to database: 33 \n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"import xmltodict\n", | |
"import urllib2\n", | |
"import timeit\n", | |
"\n", | |
"start_time = timeit.default_timer()\n", | |
"from IPython.display import display, clear_output ##### FOR USE WITH STDOUT (DYNAMIC, SINGLE-LINE PRINTING)\n", | |
"\n", | |
"start_count = filings.count()\n", | |
"counter = 0\n", | |
"for f in file_list.find({'EIN': { '$in': eins}}):\n", | |
" counter += 1\n", | |
" \n", | |
" print f\n", | |
" if 'URL' in f:\n", | |
" url = f['URL']\n", | |
" print url, '\\n'\n", | |
" url_data = urllib2.urlopen(url)\n", | |
" f_string = url_data.read()\n", | |
" datax = xmltodict.parse(f_string)\n", | |
" \n", | |
" try:\n", | |
" if 'IRS990' in datax['Return']['ReturnData']: ##### CHECK WHETHER THE 'IRS990' KEY IS PRESENT IN THE DATA\n", | |
" ##### CREATE DICTIONARY CONTAINING ALL KEYS NESTED UNDER 'Return' EXCEPT FOR THE 'ReturnData' KEY\n", | |
" ##### THERE WILL BE FIVE KEYS IN return_info: '@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', '@returnVersion', 'ReturnHeader'\n", | |
" return_info = {k:v for k,v in datax['Return'].iteritems() if k not in ('ReturnData')}\n", | |
" ##### CREATE DICTIONARY CONTAINING ALL KEYS IN ['ReturnData'] EXCEPT FOR THE 'IRS990' KEY\n", | |
" ##### THERE WILL BE A KEY CALLED '@documentCount' (OR '@documentCnt') AS WELL AS ONE PER INCLUDED SCHEDULE\n", | |
" schedules = {k:v for k,v in datax['Return']['ReturnData'].iteritems() if k not in ('IRS990')}\n", | |
" ##### CREATE DICTIONARY FOR 990 DATA\n", | |
" data = datax['Return']['ReturnData']['IRS990']\n", | |
" data = to_dict(data)\n", | |
" \n", | |
" ##### COMBINE THE DICT OF FILING INFO FROM FIRST STEP WITH FILING DATA GATHERED HERE\n", | |
" c = {key: value for (key, value) in (f.items() + return_info.items() + schedules.items() + data.items())}\n", | |
" c.pop('_id', None) #DROP 'id' (OR IT WILL NOT INSERT)\n", | |
" t = json.dumps(c)\n", | |
" #print t\n", | |
" loaded_entry = json.loads(t)\n", | |
" #print type(loaded_entry) , loaded_entry #<type 'dict'>\n", | |
" try:\n", | |
" filings.insert_one(loaded_entry)\n", | |
" except pymongo.errors.DuplicateKeyError, e:\n", | |
" print e, '\\n'\n", | |
" \n", | |
" except KeyError:\n", | |
" print 'IRS9990 key not in data (likely a 990EZ or 990PF filer)'\n", | |
" print data['Return']['ReturnData'].keys()\n", | |
" \n", | |
" else:\n", | |
" print f['IsAvailable']\n", | |
" \n", | |
" clear_output()\n", | |
" print ('counter: ', counter, f['OrganizationName']), '\\n'\n", | |
" sys.stdout.flush() \n", | |
" \n", | |
"elapsed = timeit.default_timer() - start_time\n", | |
"print '# of minutes: ', elapsed/60, '\\n'\n", | |
"print \"# of filings added to database:\", filings.count() - start_count, '\\n' " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>What we now have is a new MongoDB table, called *filings*, that contains the 33 IRS 990 filings for our sample of 5 nonprofit organizations. Each row in our database table (or, in MongoDB language, each *document* in our database *collection*) includes both the filing information from our first MongoDB table along with the 990 data we have just downloaded. \n", | |
"\n", | |
"In addition, unlike the other version of part 4 of the tutorial, we have added in 5 additional keys from the <code>['Return']</code> key as well as all available keys from the <code>['ReturnData']</code> key -- that is, those keys containing the \"schedules.\"\n", | |
"\n", | |
"In the last download above, for instance, there are 9 columns from our first 'index' MongoDB collection, 216 columns of <code>IRS 990</code> filing data, 5 new columns from the <code>['Return']</code> key, and 7 new columns containing <code>['@documentCnt']</code> as well as 6 schedules under the following keys: <code>['IRS990ScheduleA','IRS990ScheduleB', 'IRS990ScheduleD', 'IRS990ScheduleG', 'IRS990ScheduleM', 'IRS990ScheduleO']</code>. A total of 237 columns (or *fields*) in our DB." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 80, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"# of columns from first MongoDB collection: 9\n", | |
"# of columns from ['Return'] key: 5\n", | |
"# of columns from ['ReturnData'] : 7\n", | |
"# of columns from ['IRS990'] key: 216\n", | |
"Total # of columns inserted into *filings* MongoDB collection: 237\n" | |
] | |
} | |
], | |
"source": [ | |
"print \"# of columns from first MongoDB collection:\", len(f.keys()) - 1\n", | |
"print \"# of columns from ['Return'] key:\", len(return_info.keys())\n", | |
"print \"# of columns from ['ReturnData'] :\", len(schedules.keys())\n", | |
"print \"# of columns from ['IRS990'] key:\", len(data.keys())\n", | |
"print \"Total # of columns inserted into *filings* MongoDB collection:\", len(c.keys())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br>As noted above, our code will only work for organizations that have the <code>IRS990</code> key. We won't get data for 990EZ or 990PF filers. Focusing on one of the three groups of filers would be the typical research design decision. In the next tutorial we will modify our code to download the 990EZ and 990PF data." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python [default]", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.13" | |
}, | |
"toc": { | |
"colors": { | |
"hover_highlight": "#DAA520", | |
"navigate_num": "#000000", | |
"navigate_text": "#333333", | |
"running_highlight": "#FF0000", | |
"selected_highlight": "#FFD700", | |
"sidebar_border": "#EEEEEE", | |
"wrapper_background": "#FFFFFF" | |
}, | |
"moveMenuLeft": true, | |
"nav_menu": { | |
"height": "279px", | |
"width": "252px" | |
}, | |
"navigate_menu": true, | |
"number_sections": true, | |
"sideBar": true, | |
"threshold": 4, | |
"toc_cell": false, | |
"toc_section_display": "block", | |
"toc_window_display": false, | |
"widenNotebook": false | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment