Skip to content

Instantly share code, notes, and snippets.

@pybokeh
Created April 15, 2016 12:04
Show Gist options
  • Select an option

  • Save pybokeh/26e4ebb552f25af06e8a9f21819ed02c to your computer and use it in GitHub Desktop.

Select an option

Save pybokeh/26e4ebb552f25af06e8a9f21819ed02c to your computer and use it in GitHub Desktop.
sqlite csv example
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center><h1>Creating a sqlite table from a large CSV file</h1></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This example will show you how to create a sqlite table consisting of customer complaints filed with NHTSA (U.S. National Highway Traffic Safety Administration). The metadata on NHTSA's csv file (data about the data), can be viewed [here](http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/CMPL.txt)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This example assumes that a nhtsa.db sqlite database was already created. At terminal enter: sqlite3 nhtsa.db"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download the zip file in memory, then extract"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"run_control": {
"read_only": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zip download and extraction complete\n"
]
}
],
"source": [
"from io import BytesIO\n",
"from zipfile import ZipFile\n",
"from urllib import request\n",
"\n",
"url = request.urlopen('http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/FLAT_CMPL.zip')\n",
"zipfile_in_memory = ZipFile(BytesIO(url.read()))\n",
"zipfile_in_memory.extractall(r'D:\\temp')\n",
"zipfile_in_memory.close()\n",
"print(\"zip download and extraction complete\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normal imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"run_control": {
"read_only": true
}
},
"outputs": [],
"source": [
"import sqlite3\n",
"import pandas as pd\n",
"import datetime as dt\n",
"pd.set_option(\"display.max_rows\",1000)\n",
"pd.set_option(\"display.max_columns\",50)\n",
"pd.set_option('max_colwidth',40)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Since the flat file doesn't contain column headers, had to create one"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true,
"run_control": {
"read_only": true
}
},
"outputs": [],
"source": [
"columns = [\n",
" 'CMPLID',\n",
" 'ODINO',\n",
" 'MFR_NAME',\n",
" 'MAKETXT',\n",
" 'MODELTXT',\n",
" 'YEARTXT',\n",
" 'CRASH',\n",
" 'FAILDATE',\n",
" 'FIRE',\n",
" 'INJURED',\n",
" 'DEATHS',\n",
" 'COMPDESC',\n",
" 'CITY',\n",
" 'STATE',\n",
" 'VIN',\n",
" 'DATEA',\n",
" 'LDATE',\n",
" 'MILES',\n",
" 'OCCURENCES',\n",
" 'CDESCR',\n",
" 'CMPL_TYPE',\n",
" 'POLICE_RPT_YN',\n",
" 'PURCH_DT',\n",
" 'ORIG_OWNER_YN',\n",
" 'ANTI_BRAKES_YN',\n",
" 'CRUISE_CONT_YN',\n",
" 'NUM_CYLS',\n",
" 'DRIVE_TRAIN',\n",
" 'FUEL_SYS',\n",
" 'FUEL_TYPE',\n",
" 'TRANS_TYPE',\n",
" 'VEH_SPEED',\n",
" 'DOT',\n",
" 'TIRE_SIZE',\n",
" 'LOC_OF_TIRE',\n",
" 'TIRE_FAIL_TYPE',\n",
" 'ORIG_EQUIP_YN',\n",
" 'MANUF_DT',\n",
" 'SEAT_TYPE',\n",
" 'RESTRAINT_TYPE',\n",
" 'DEALER_NAME',\n",
" 'DEALER_TEL',\n",
" 'DEALER_CITY',\n",
" 'DEALER_STATE',\n",
" 'DEALER_ZIP',\n",
" 'PROD_TYPE',\n",
" 'REPAIRED_YN',\n",
" 'MEDICAL_ATTN',\n",
" 'VEHICLES_TOWED_YN'\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to the sqlite3 database and read in the flat file in chunks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will create/populate the 'complaints' table in the nhtsa.db sqlite database. This technique is essentially processing 20K rows at a time and appending them to the complaints table. Since we are processing 20K rows at a time, this technique will work with large csv files (larger than RAM, but fits on hard drive). Adjust the chunk size depending on your computer's hardware."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false,
"run_control": {
"read_only": true
},
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"395 seconds: completed 1280000 rows"
]
}
],
"source": [
"conn = sqlite3.connect(r'D:\\temp\\nhtsa.db')\n",
"\n",
"start = dt.datetime.now()\n",
"chunksize = 20000\n",
"j = 0\n",
"\n",
"for df in pd.read_csv(r'D:\\temp\\FLAT_CMPL.txt', names=columns, dtype=object, chunksize=chunksize, \n",
" delimiter='\\t', iterator=True, encoding='ISO-8859-1', error_bad_lines=False): \n",
" j+=1\n",
" # To print on same line, use '\\r' and end='' option with the print function\n",
" print('\\r'+'{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize),end='')\n",
"\n",
" df.to_sql('complaints', conn, flavor='sqlite', if_exists='append', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some basic querying against the sqlite3 database"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, I am looking for \"serious\" complaints from Ford vehicles that were filed in 2016."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"collapsed": false,
"run_control": {
"read_only": true
}
},
"outputs": [],
"source": [
"sample = pd.read_sql_query('SELECT MFR_NAME, MAKETXT, MODELTXT, YEARTXT, CRASH, FIRE, INJURED, DEATHS, '\n",
" 'COMPDESC, MILES, LDATE, OCCURENCES, CDESCR '\n",
" \n",
" 'FROM complaints '\n",
" \n",
" 'WHERE '\n",
" \"LDATE like '2016%' \"\n",
" \"AND MAKETXT IN('FORD') \"\n",
" \"AND (CRASH = 'Y' \"\n",
" \"OR FIRE = 'Y' \"\n",
" \"OR INJURED = 'Y' \"\n",
" \"OR DEATHS = 'Y' \"\n",
" \"OR MEDICAL_ATTN = 'Y' \"\n",
" \"OR VEHICLES_TOWED_YN = 'Y') \"\n",
" 'limit 5', conn)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MFR_NAME</th>\n",
" <th>MAKETXT</th>\n",
" <th>MODELTXT</th>\n",
" <th>YEARTXT</th>\n",
" <th>CRASH</th>\n",
" <th>FIRE</th>\n",
" <th>INJURED</th>\n",
" <th>DEATHS</th>\n",
" <th>COMPDESC</th>\n",
" <th>MILES</th>\n",
" <th>LDATE</th>\n",
" <th>OCCURENCES</th>\n",
" <th>CDESCR</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ford Motor Company</td>\n",
" <td>FORD</td>\n",
" <td>FIVE HUNDRED</td>\n",
" <td>2005</td>\n",
" <td>Y</td>\n",
" <td>N</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>AIR BAGS</td>\n",
" <td>55200</td>\n",
" <td>20160104</td>\n",
" <td>None</td>\n",
" <td>THE VEHICLE WAS IN A FRONT END COLLI...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Ford Motor Company</td>\n",
" <td>FORD</td>\n",
" <td>FUSION</td>\n",
" <td>2015</td>\n",
" <td>Y</td>\n",
" <td>N</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>SERVICE BRAKES</td>\n",
" <td>30000</td>\n",
" <td>20160104</td>\n",
" <td>None</td>\n",
" <td>TL* THE CONTACT OWNS A 2015 FORD FUS...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Ford Motor Company</td>\n",
" <td>FORD</td>\n",
" <td>FUSION</td>\n",
" <td>2015</td>\n",
" <td>Y</td>\n",
" <td>N</td>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>AIR BAGS</td>\n",
" <td>30000</td>\n",
" <td>20160104</td>\n",
" <td>None</td>\n",
" <td>TL* THE CONTACT OWNS A 2015 FORD FUS...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Ford Motor Company</td>\n",
" <td>FORD</td>\n",
" <td>FOCUS</td>\n",
" <td>2010</td>\n",
" <td>Y</td>\n",
" <td>N</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>SUSPENSION</td>\n",
" <td>87500</td>\n",
" <td>20160104</td>\n",
" <td>None</td>\n",
" <td>I HAVE OWNED THE CAR SINCE IT WAS NE...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Ford Motor Company</td>\n",
" <td>FORD</td>\n",
" <td>WINDSTAR</td>\n",
" <td>2004</td>\n",
" <td>N</td>\n",
" <td>Y</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>ENGINE</td>\n",
" <td>118000</td>\n",
" <td>20160105</td>\n",
" <td>None</td>\n",
" <td>TL* THE CONTACT OWNED A 2004 FORD WI...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" MFR_NAME MAKETXT MODELTXT YEARTXT CRASH FIRE INJURED DEATHS \\\n",
"0 Ford Motor Company FORD FIVE HUNDRED 2005 Y N 2 0 \n",
"1 Ford Motor Company FORD FUSION 2015 Y N 1 None \n",
"2 Ford Motor Company FORD FUSION 2015 Y N 1 None \n",
"3 Ford Motor Company FORD FOCUS 2010 Y N None None \n",
"4 Ford Motor Company FORD WINDSTAR 2004 N Y None None \n",
"\n",
" COMPDESC MILES LDATE OCCURENCES \\\n",
"0 AIR BAGS 55200 20160104 None \n",
"1 SERVICE BRAKES 30000 20160104 None \n",
"2 AIR BAGS 30000 20160104 None \n",
"3 SUSPENSION 87500 20160104 None \n",
"4 ENGINE 118000 20160105 None \n",
"\n",
" CDESCR \n",
"0 THE VEHICLE WAS IN A FRONT END COLLI... \n",
"1 TL* THE CONTACT OWNS A 2015 FORD FUS... \n",
"2 TL* THE CONTACT OWNS A 2015 FORD FUS... \n",
"3 I HAVE OWNED THE CAR SINCE IT WAS NE... \n",
"4 TL* THE CONTACT OWNED A 2004 FORD WI... "
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Since the output from a pandas data frame is limiting, created this simple routine to better view the customer's complaint"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"collapsed": false,
"run_control": {
"read_only": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2005 FORD FIVE HUNDRED==> THE VEHICLE WAS IN A FRONT END COLLISION WITH THE FRONT DRIVER SIDE PRIMARILY DAMAGED. THE VEHICLE WAS TRAVELING DOWN A CITY STREET AT ROUGHLY 35 TO 40 MPH UPHILL. THE VEHICLE REAR ENDED A STATIONARY VEHICLE THAT WAS ATTEMPTING TO TURN. THE AIRBAG SENSOR WAS LOCATED ON THE VEHICLE AND WAS HIT, HOWEVER, THE AIRBAGS DID NOT DEPLOY. CAUSING DAMAGE TO THE DRIVER IN THE ACCIDENT. THEY CLEARLY SHOULD HAVE DEPLOYED BY THE APPEARANCE OF THE SENSOR.\n",
"*****************************************************************************************\n",
"2015 FORD FUSION==> TL* THE CONTACT OWNS A 2015 FORD FUSION. THE CONTACT STATED THAT WHILE MAKING A TURN, THE BRAKES PULSATED WHEN THE VEHICLE WAS STOPPED. IN ADDITION, THE CONTACT STATED THAT WHILE ATTEMPTING TO PARK, THE BRAKES VIOLENTLY PULSATED AND RESULTED IN AN UNINTENDED ACCELERATION. AS A RESULT, THE CONTACT CRASHED INTO A BUILDING. THE AIR BAGS FAILED TO DEPLOY AND THE SEAT BELT FAILED TO RETRACT. A POLICE REPORT WAS FILED. THERE WAS ONE UNKNOWN INJURY REPORTED THAT DID NOT REQUIRE MEDICAL ATTENTION. THE VEHICLE WAS NOT DIAGNOSED OR REPAIRED. THE MANUFACTURER WAS NOT MADE AWARE OF THE FAILURE. THE FAILURE MILEAGE WAS 30,000. \n",
"*****************************************************************************************\n",
"2015 FORD FUSION==> TL* THE CONTACT OWNS A 2015 FORD FUSION. THE CONTACT STATED THAT WHILE MAKING A TURN, THE BRAKES PULSATED WHEN THE VEHICLE WAS STOPPED. IN ADDITION, THE CONTACT STATED THAT WHILE ATTEMPTING TO PARK, THE BRAKES VIOLENTLY PULSATED AND RESULTED IN AN UNINTENDED ACCELERATION. AS A RESULT, THE CONTACT CRASHED INTO A BUILDING. THE AIR BAGS FAILED TO DEPLOY AND THE SEAT BELT FAILED TO RETRACT. A POLICE REPORT WAS FILED. THERE WAS ONE UNKNOWN INJURY REPORTED THAT DID NOT REQUIRE MEDICAL ATTENTION. THE VEHICLE WAS NOT DIAGNOSED OR REPAIRED. THE MANUFACTURER WAS NOT MADE AWARE OF THE FAILURE. THE FAILURE MILEAGE WAS 30,000. \n",
"*****************************************************************************************\n",
"2010 FORD FOCUS==> I HAVE OWNED THE CAR SINCE IT WAS NEW NOW HAS 88K MILES I HAVE PUT LITERALLY 4 SET OF TIRES ON THIS AND HAVE HAD A REAR ALIGNMENT KIT DONE TO IT .THIS CAR SWAYS REALLY BAD AT SPEEDS OF OVER 65 MPH AND WHEN THERE IS HARDLY ANY WIND AT ALL . VERY DANGEROUS TO DRIVE IN THE SNOW OR RAIN . THE CAR CAUSED ME TO HAVE A ACCIDENT THAT COST MY INSURANCE COMPANY OVER 2000 TO REPAIR DUE TO THE SWAYING OF SPEEDS OVER 65 MPH . I HAVE A HORRIBLE THUMPING NOISE COMING FROM THE FRONT END SUSPENSION AREA WHICH I THOUGHT IT MIGHT BE I HAD BAD TIRES SO REPLACED TO BAD THUMPING SO REPLACED WHEELS STILL BAD AND WHEEL BEARINGS AND STILL BAD THUMPING IDK WHAT TO DO THIS IS SO ANNOYING I WANT FORD TO PAY FOR ALL OF THIS AND TO RECALL THESE CARS THEY ARE VERY DANGEROUS\n",
"*****************************************************************************************\n",
"2004 FORD WINDSTAR==> TL* THE CONTACT OWNED A 2004 FORD WINDSTAR. THE CONTACT STATED THAT THE ENGINE CAUGHT ON FIRE. THE CONTACT STATED THAT THE FAILURE OCCURRED AFTER THE VEHICLE RECEIVED ROUTINE MAINTENANCE WHERE THE BATTERY, THE WATER PUMP, AND THE ENGINE BELT WERE REPLACED. THE FIRE DEPARTMENT WAS PRESENT. THE MANUFACTURER WAS NOT NOTIFIED OF THE FAILURE. THE VEHICLE WAS DESTROYED. THE FAILURE MILEAGE WAS 118,000. \n",
"*****************************************************************************************\n"
]
}
],
"source": [
"for year, make, model, cdescr in sample[['YEARTXT','MAKETXT','MODELTXT','CDESCR']].values:\n",
" print(year + ' ' + make + ' ' + model + '==> ' + cdescr)\n",
" print(\"*****************************************************************************************\")"
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment