kenorb · October 6, 2020 21:23
diff --git a/IRS 990 e-File Data (2) -- Load Index Data and Insert into MongoDB.ipynb b/IRS 990 e-File Data (2) -- Load Index Data and Insert into MongoDB.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Overview\n",
    "\n",
    "This is the second in a series of tutorials that illustrate how to download the IRS 990 e-file data available at https://aws.amazon.com/public-data-sets/irs-990/\n",
    "\n",
    "In this second notebook we will load the index files we download in <a href=\"http://social-metrics.org/irs-990-e-file-data-part-1/\">part 1</a> and insert them into a MongoDB database."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load packages and set working directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import time\n",
    "import json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/gsaxton/Dropbox/990 e-file data\n"
     ]
    }
   ],
   "source": [
    "cd '/Users/gsaxton/Dropbox/990 e-file data' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### MongoDB\n",
    "Depending on the project, I will store the data in SQLite or MongoDB. This time I'll use MongoDB -- it's great for storing JSON data where each observation could have different variables. Before we get to the interesting part the following code blocks set up the MongoDB environment and the new database we'll be using. \n",
    "\n",
    "**_Note:_** In a terminal we'll have to start MongoDB by running the command *mongod* or *sudo mongod*. Then we run the following code block here to access MongoDB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pymongo\n",
    "from pymongo import MongoClient\n",
    "client = MongoClient()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "We first have to define a database and then a table or *collection* where for storing the File listing information we downloaded in part 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# DEFINE THE MONGODB DATABASE\n",
    "db = client['irs_990_db']\n",
    "\n",
    "# DEFINE THE COLLECTION WHERE I'LL INSERT THE DATA\n",
    "file_list = db['990_files_2011_2018']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>We will also set an index on the collection to keep duplicates from being inserted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'ObjectId_1'"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db.file_list.create_index([('ObjectId', pymongo.ASCENDING)], unique=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'_id_', u'ObjectId_1']"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(db.file_list.index_information())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Check how many observations in the database table. It will be zero until we add data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "file_list.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load data\n",
    "Let's load the dictionary we download in part 1. It is a big file: ~911MB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pickle\n",
    "with open('data.pickle', 'rb') as handle:\n",
    "    data = pickle.load(handle)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Inspecting the data, we see that as expected there are 8 major *keys* in the dictionary -- one per year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8 \n",
      "\n",
      "['Filings2014', 'Filings2015', 'Filings2016', 'Filings2017', 'Filings2011', 'Filings2012', 'Filings2013', 'Filings2018']\n"
     ]
    }
   ],
   "source": [
    "print len(data), '\\n'\n",
    "print data.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Let's use again our python *list* containing all index file years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]\n"
     ]
    }
   ],
   "source": [
    "year_list = []\n",
    "for year in range(2011, 2019, 1):\n",
    "    year_list.append(year)\n",
    "print year_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>In part 1 we create our python *dictionary* (which we called *data*) that holds each year's worth of filings. In the following block of code we begin a *for loop*, looping over each year of our list to in turn access the respective annual key in our dictionary. \n",
    "\n",
    "If you are unfamiliar with Python, each year has its own *key* in the *data* dictionary; for instance the 2011 data will be nested under the  <code>data['Filings2011']</code> key. We use the *year* values in the python list we have just created to access each dictionary key in turn. The trick is the <code>%s</code> string formatting placeholder. The <code>%s</code> signifies a variable, and the <code>year</code> following the percentage sign tells the code which value to use for this variable. Within the context of a <code>for loop</code>, we thus have code that will create each of our keys in turn. \n",
    "\n",
    "In the final line of code here we access the key in each loop and print out the number of filings for that year. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of filings in 2011 : 203075\n",
      "# of filings in 2012 : 261622\n",
      "# of filings in 2013 : 261449\n",
      "# of filings in 2014 : 387529\n",
      "# of filings in 2015 : 261034\n",
      "# of filings in 2016 : 378420\n",
      "# of filings in 2017 : 489013\n",
      "# of filings in 2018 : 131168\n"
     ]
    }
   ],
   "source": [
    "for year in year_list:\n",
    "    print '# of filings in', year, ':', len(data['Filings%s' % year])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>To verify the data we could count the number of total entries the long way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2373310"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "203075+261622+261449+387529+261034+378420+489013+131168"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Or with a shortcut:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2373310"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum(len(v) for v in data.itervalues())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read JSON file into MongoDB database\n",
    "What we will do now is insert our data into the MongoDB database. Because the data are nested in the 8 annual dictionary keys, we will loop over each of those keys and insert each year individually. The <code>insert_many</code> takes care of the data insertion. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of filings in database: 0\n",
      "# of filings in 2011 added to database: 203075 \n",
      "\n",
      "# of filings in database: 203075\n",
      "# of filings in 2012 added to database: 261622 \n",
      "\n",
      "# of filings in database: 464697\n",
      "# of filings in 2013 added to database: 261449 \n",
      "\n",
      "# of filings in database: 726146\n",
      "# of filings in 2014 added to database: 387529 \n",
      "\n",
      "# of filings in database: 1113675\n",
      "# of filings in 2015 added to database: 261034 \n",
      "\n",
      "# of filings in database: 1374709\n",
      "# of filings in 2016 added to database: 378420 \n",
      "\n",
      "# of filings in database: 1753129\n",
      "# of filings in 2017 added to database: 489013 \n",
      "\n",
      "# of filings in database: 2242142\n",
      "# of filings in 2018 added to database: 131168 \n",
      "\n",
      "Total # of filings in database: 2373310\n"
     ]
    }
   ],
   "source": [
    "for year in year_list:\n",
    "    print '# of filings in database:', file_list.count()\n",
    "    file_list.insert_many(data['Filings%s' % year])\n",
    "    print '# of filings in', year, 'added to database:', len(data['Filings%s' % year]), '\\n'\n",
    "\n",
    "print 'Total # of filings in database:', file_list.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br><br>We now have a MongoDB database with one collection (or table) that holds 2,373,310 filings. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2373310"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "file_list.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>We can inspect the data by checking the first two filings in the database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{u'OrganizationName': u'ANGELUS INC', u'ObjectId': u'201103169349300325', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103169349300325_public.xml', u'SubmittedOn': u'2011-11-30', u'DLN': u'93493316003251', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201009', u'FormType': u'990', u'_id': ObjectId('5ad8e83235fd3fae98d5e53b'), u'EIN': u'591971002'} \n",
      "\n",
      "{u'OrganizationName': u'TOUCH-STONE SOLUTIONS INC', u'ObjectId': u'201113139349301231', u'URL': u'https://s3.amazonaws.com/irs-form-990/201113139349301231_public.xml', u'SubmittedOn': u'2011-11-30', u'DLN': u'93493313012311', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990', u'_id': ObjectId('5ad8e83235fd3fae98d5e53c'), u'EIN': u'251713602'} \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for user in file_list.find()[:2]:\n",
    "    print user, '\\n'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also get the frequency counts for the various *FormType*s. We see the frequencies for 990PF, 990EZ, and 990 filings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{u'_id': u'990PF', u'count': 318823},\n",
       " {u'_id': u'990EZ', u'count': 715541},\n",
       " {u'_id': u'990', u'count': 1338946}]"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from bson.son import SON\n",
    "pipeline = [ {\"$group\": {\"_id\": \"$FormType\", \"count\": {\"$sum\": 1}}} ]\n",
    "list(file_list.aggregate(pipeline))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Note that we now only have a database containing an index of the 2,373,310 990 filings. In order to actually use the 990 data, we will first have to download the filings listed in this database as well as create a data dictionary that can map out the variables we're interested in. The next set of tutorials will cover those steps. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "279px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }