How to download, import, and analyze San Francisco restaurant inspection data using SQLite3 and csvkit from the command-line.
A quick example of doing data wrangling from the command-line, as well as getting to know one of San Francisco's data sets: the San Francisco restaurant inspections, courtesy of the SF Department of Public Health. I don't normally do database work from the command-line, but importing bulk data into SQLite is pretty frustrating using the available GUIs or just the shell.
So thank goodness for Christopher Groskopf's csvkit, a suite of Unix-like tools that use Python to robustly handle CSV files. There's a lot of great tools in csvkit, but for this gist, I just use csvsql, which can parse a CSV and turn it into properly-flavored SQL to pass directly into your database app of choice.
This gist is basically one giant Bash script (OS X 10.10). csvsql
has a convenient mode that will read in a CSV file and generate a CREATE TABLE
statement:
$ csvsql some.csv --dialect sqlite --tables some_table_name
I ran that to generate the initial CREATE TABLE
commands, which I adjusted to my own liking...since the restaurant inspection data is not that big, we don't have to worry too much about optimization. We could probably get by without even indexing it.
The SF restaurant inspection data has a landing page here and a fairly user-friendly search page.
This is what you see if you do a search for the inspection records for the In-N-Out Burger:
(Thankfully, it's been doing fine...whew!)
The actual raw data is in an external directory:
That zip file holds 3 files:
- businesses_plus.csv - unique identifiers, names, addresses, geospatial coordinates, and other data related to each individual business
- inspections_plus.csv - contains a row for each inspection, including the business_id of the inspected business, the score it received (if applicable), the date of the inspection, and the reason for the inspection
- violations_plus.csv - contains a row for every violation, including the severity and a short description of the violation. It contains business_id and an inspection date which can presumably be used to join against the inspections data.
##############
# Setup your workspace
BASE_DIR=/tmp/whatever
DB_PATH=$BASE_DIR/sf_food_program_db.sqlite
ZIP_DIR=/tmp/sffoodzips
mkdir -p $BASE_DIR
mkdir -p $ZIP_DIR
################
# Download the files into $ZIP_DIR
curl https://extxfer.sfdph.org/food/SFFoodProgram_Complete_Data.zip \
-o $ZIP_DIR/SFFoodProgram_Complete_Data.zip
unzip $ZIP_DIR/SFFoodProgram_Complete_Data.zip -d $ZIP_DIR
At this point, we just pipe raw SQL into the sqlite3
program. I previously used csvsql
with the --dialect sqlite
option to generate the CREATE
statements.
#################
# Build the database
# First remove the existing database file, if any
rm -f $DB_PATH
## Create the businesses table
echo "CREATE TABLE businesses (
business_id INTEGER NOT NULL,
name VARCHAR,
address VARCHAR,
city VARCHAR,
postal_code INTEGER,
latitude FLOAT,
longitude FLOAT,
phone_number BIGINT,
TaxCode VARCHAR,
business_certificate INTEGER,
application_date DATE,
owner_name VARCHAR,
owner_address VARCHAR,
owner_city VARCHAR,
owner_state VARCHAR,
owner_zip VARCHAR
);" | sqlite3 $DB_PATH
## Create the inspections table
echo "CREATE TABLE inspections (
business_id INTEGER NOT NULL,
Score INTEGER,
date DATE,
type VARCHAR
);" | sqlite3 $DB_PATH
## Create the violations table
echo "CREATE TABLE violations (
business_id VARCHAR NOT NULL,
date DATE NOT NULL,
ViolationTypeID INTEGER NOT NULL,
risk_category VARCHAR(20),
description VARCHAR(100)
);" | sqlite3 $DB_PATH
This is where csvsql
really helps. SQLite3 (as far as I know) does not have a robust CSV parser (i.e. can't handle quoted fields and funky line breaks). So the csvsql
command will read a CSV and convert it into SQL INSERT
statements.
#####################
# Insert the data
## The businesses data
## Note that it's encoded as `windows-1252`, hence the `iconv` call to
## convert it to UTF-8
iconv -f windows-1252 -t UTF-8 \
< $ZIP_DIR/businesses_plus.csv |
csvsql --db sqlite:///$DB_PATH --insert --no-create \
--tables businesses
## Insert the inspections data
csvsql $ZIP_DIR/inspections_plus.csv \
--db sqlite:///$DB_PATH --insert --no-create \
--tables inspections
## The dates come as `YYYYMMDD`; this update statement
## changes it to `YYYY-MM-DD`
echo "UPDATE inspections
SET date = SUBSTR(date, 1, 4) || '-' ||
SUBSTR(date, 5, 2) || '-' ||
SUBSTR(date, 7, 2);
" | sqlite3 $DB_PATH
## Insert the violations data
## (this will take awhile)
csvsql $ZIP_DIR/violations_plus.csv \
--db sqlite:///$DB_PATH --insert --no-create --tables violations
## The dates come as `YYYYMMDD`; this update statement
## changes it to `YYYY-MM-DD`
echo "UPDATE violations
SET date = SUBSTR(date, 1, 4) || '-' ||
SUBSTR(date, 5, 2) || '-' ||
SUBSTR(date, 7, 2);
" | sqlite3 $DB_PATH
This is just more piping of raw SQL.
#############
# Add indexes to the table
echo "
CREATE INDEX business_id_o_businesses_idx ON businesses(business_id);
CREATE INDEX name_o_businesses_idx ON businesses(name);
CREATE INDEX phone_o_businesses_idx ON businesses(phone_number);
CREATE INDEX business_id_date_o_inspections_idx ON inspections(business_id, date);
CREATE INDEX date_o_inspections_idx ON inspections(date);
CREATE INDEX business_id_date_o_violations_idx ON violations(business_id, date);
CREATE INDEX date_o_violations_idx ON violations(date);
CREATE INDEX date_o_violation_type_id_idx ON violations(ViolationTypeID);
" | sqlite3 $DB_PATH
When everything is done, we can write a query to test out the data.
The following query asks:
- Find every business with "Starbucks" in its name
- that has had an inspection with a date since 2014
- and that inspection resulted in a score below 90 points
- and that inspection found at least one "High Risk" violation
Then, for every violation that is found above, display:
- The name of the business
- The address of the business
- The score of the inspection in which the violation was found
- The date of that inspection
- The description of the violation
And list the violations in ascending order of the score of its respective inspection.
Passing that query via shell:
echo "
.headers on
SELECT businesses.name AS biz_name,
businesses.address AS biz_address,
inspections.Score AS score,
CAST(inspections.date AS CHAR) AS inspection_date,
violations.description AS violation_description
FROM businesses
INNER JOIN inspections
ON businesses.business_id = inspections.business_id
INNER JOIN violations
ON inspections.business_id = violations.business_id
AND inspections.date = violations.date
WHERE inspection_date > "2014"
AND biz_name LIKE '%STARBUCKS%'
AND score < 90
AND violations.risk_category = 'High Risk'
ORDER BY score;
" | sqlite3 $DB_PATH
The result:
biz_name | biz_address | score | inspection_date | violation_description |
---|---|---|---|---|
Starbucks Coffee Co | 0264 Kearny St | 85 | 2014-01-03 | Improper cooling methods |
Starbucks Coffee | 201 Powell St | 87 | 2015-03-17 | Unclean hands or improper use of gloves |
STARBUCKS | 350 RHODE ISLAND | 89 | 2014-03-19 | No hot water or running water |
STARBUCKS COFFEE #9219 | 333 MARKET St | 89 | 2014-10-03 | High risk food holding temperature |
STARBUCKS COFFEE CO | 1231 MARKET St | 89 | 2014-12-30 | Unclean hands or improper use of gloves |
Note: I do not guarantee the accuracy of these results -- e.g. I didn't check to see if unique IDs were really kept unique, and so forth. I didn't manually double check all of these results against the online database. But it seems to check out for a couple of the places I eyeballed. It's also worth noting that the vast majority of Starbucks did very well...a handful of mishaps in a year-and-a-half isn't too bad at all.
For example, here's the entry for Starbucks on 333 Market St, with the October 3, 2014 violation of "High risk food holding temperature":