How to download, import, and analyze San Francisco restaurant inspection data using SQLite3 and csvkit from the command-line.

A quick example of doing data wrangling from the command-line, as well as getting to know one of San Francisco's data sets: the San Francisco restaurant inspections, courtesy of the SF Department of Public Health. I don't normally do database work from the command-line, but importing bulk data into SQLite is pretty frustrating using the available GUIs or just the shell.

So thank goodness for Christopher Groskopf's csvkit, a suite of Unix-like tools that use Python to robustly handle CSV files. There's a lot of great tools in csvkit, but for this gist, I just use csvsql, which can parse a CSV and turn it into properly-flavored SQL to pass directly into your database app of choice.

This gist is basically one giant Bash script (OS X 10.10). csvsql has a convenient mode that will read in a CSV file and generate a CREATE TABLE statement:

$ csvsql some.csv --dialect sqlite --tables some_table_name

I ran that to generate the initial CREATE TABLE commands, which I adjusted to my own liking...since the restaurant inspection data is not that big, we don't have to worry too much about optimization. We could probably get by without even indexing it.

About the data

The SF restaurant inspection data has a landing page here and a fairly user-friendly search page.

This is what you see if you do a search for the inspection records for the In-N-Out Burger:

(Thankfully, it's been doing fine...whew!)

The actual raw data is in an external directory:

That zip file holds 3 files:

businesses_plus.csv - unique identifiers, names, addresses, geospatial coordinates, and other data related to each individual business
inspections_plus.csv - contains a row for each inspection, including the business_id of the inspected business, the score it received (if applicable), the date of the inspection, and the reason for the inspection
violations_plus.csv - contains a row for every violation, including the severity and a short description of the violation. It contains business_id and an inspection date which can presumably be used to join against the inspections data.

Setting up your workspace and downloading the data

############## 
# Setup your workspace
BASE_DIR=/tmp/whatever
DB_PATH=$BASE_DIR/sf_food_program_db.sqlite
ZIP_DIR=/tmp/sffoodzips
mkdir -p $BASE_DIR
mkdir -p $ZIP_DIR

################ 
# Download the files into $ZIP_DIR
curl https://extxfer.sfdph.org/food/SFFoodProgram_Complete_Data.zip \
      -o $ZIP_DIR/SFFoodProgram_Complete_Data.zip
unzip $ZIP_DIR/SFFoodProgram_Complete_Data.zip -d $ZIP_DIR

Create the database and tables

At this point, we just pipe raw SQL into the sqlite3 program. I previously used csvsql with the --dialect sqlite option to generate the CREATE statements.

#################
# Build the database

# First remove the existing database file, if any
rm -f $DB_PATH

## Create the businesses table
echo "CREATE TABLE businesses (
  business_id INTEGER NOT NULL, 
  name VARCHAR, 
  address VARCHAR, 
  city VARCHAR, 
  postal_code INTEGER, 
  latitude FLOAT, 
  longitude FLOAT, 
  phone_number BIGINT, 
  TaxCode VARCHAR, 
  business_certificate INTEGER, 
  application_date DATE, 
  owner_name VARCHAR, 
  owner_address VARCHAR, 
  owner_city VARCHAR, 
  owner_state VARCHAR, 
  owner_zip VARCHAR
);" | sqlite3 $DB_PATH



## Create the inspections table
echo "CREATE TABLE inspections (
  business_id INTEGER NOT NULL, 
  Score INTEGER, 
  date DATE, 
  type VARCHAR
);" | sqlite3 $DB_PATH

## Create the violations table
echo "CREATE TABLE violations (
  business_id VARCHAR NOT NULL, 
  date DATE NOT NULL, 
  ViolationTypeID INTEGER NOT NULL, 
  risk_category VARCHAR(20), 
  description VARCHAR(100)
);" | sqlite3 $DB_PATH

Insert the data with csvsql

This is where csvsql really helps. SQLite3 (as far as I know) does not have a robust CSV parser (i.e. can't handle quoted fields and funky line breaks). So the csvsql command will read a CSV and convert it into SQL INSERT statements.

#####################
# Insert the data

## The businesses data
## Note that it's encoded as `windows-1252`, hence the `iconv` call to 
## convert it to UTF-8
iconv -f windows-1252 -t UTF-8 \
      < $ZIP_DIR/businesses_plus.csv | 
      csvsql --db sqlite:///$DB_PATH --insert --no-create \
      --tables businesses

## Insert the inspections data
csvsql $ZIP_DIR/inspections_plus.csv  \
    --db sqlite:///$DB_PATH --insert --no-create \
    --tables inspections

## The dates come as `YYYYMMDD`; this update statement
## changes it to `YYYY-MM-DD`
echo "UPDATE inspections
  SET date = SUBSTR(date, 1, 4) || '-' || 
             SUBSTR(date, 5, 2) || '-' || 
             SUBSTR(date, 7, 2);
" | sqlite3 $DB_PATH

## Insert the violations data 
## (this will take awhile)
csvsql $ZIP_DIR/violations_plus.csv  \
    --db sqlite:///$DB_PATH --insert --no-create --tables violations

## The dates come as `YYYYMMDD`; this update statement
## changes it to `YYYY-MM-DD`
echo "UPDATE violations
  SET date = SUBSTR(date, 1, 4) || '-' || 
             SUBSTR(date, 5, 2) || '-' || 
             SUBSTR(date, 7, 2);
" | sqlite3 $DB_PATH

Add indexes to the tables

This is just more piping of raw SQL.

#############
# Add indexes to the table

echo "
CREATE INDEX business_id_o_businesses_idx ON businesses(business_id);
CREATE INDEX name_o_businesses_idx ON businesses(name);
CREATE INDEX phone_o_businesses_idx ON businesses(phone_number);

CREATE INDEX business_id_date_o_inspections_idx ON inspections(business_id, date);
CREATE INDEX date_o_inspections_idx ON inspections(date);
CREATE INDEX business_id_date_o_violations_idx ON violations(business_id, date);

CREATE INDEX date_o_violations_idx ON violations(date);
CREATE INDEX date_o_violation_type_id_idx ON violations(ViolationTypeID);
" | sqlite3 $DB_PATH

Let's look for Starbucks!

When everything is done, we can write a query to test out the data.

The following query asks:

Find every business with "Starbucks" in its name
- that has had an inspection with a date since 2014
- and that inspection resulted in a score below 90 points
- and that inspection found at least one "High Risk" violation

Then, for every violation that is found above, display:

The name of the business
The address of the business
The score of the inspection in which the violation was found
The date of that inspection
The description of the violation

And list the violations in ascending order of the score of its respective inspection.

Passing that query via shell:

echo "
.headers on
SELECT businesses.name AS biz_name, 
  businesses.address AS biz_address,
  inspections.Score AS score,
  CAST(inspections.date AS CHAR)  AS inspection_date,
  violations.description AS violation_description
FROM businesses
INNER JOIN inspections
    ON businesses.business_id = inspections.business_id
INNER JOIN violations
    ON inspections.business_id = violations.business_id 
       AND inspections.date = violations.date 

WHERE inspection_date > "2014" 
  AND biz_name LIKE '%STARBUCKS%'
  AND score < 90
  AND violations.risk_category = 'High Risk'
ORDER BY score;
"  | sqlite3 $DB_PATH

The result:

biz_name	biz_address	score	inspection_date	violation_description
Starbucks Coffee Co	0264 Kearny St	85	2014-01-03	Improper cooling methods
Starbucks Coffee	201 Powell St	87	2015-03-17	Unclean hands or improper use of gloves
STARBUCKS	350 RHODE ISLAND	89	2014-03-19	No hot water or running water
STARBUCKS COFFEE #9219	333 MARKET St	89	2014-10-03	High risk food holding temperature
STARBUCKS COFFEE CO	1231 MARKET St	89	2014-12-30	Unclean hands or improper use of gloves

Note: I do not guarantee the accuracy of these results -- e.g. I didn't check to see if unique IDs were really kept unique, and so forth. I didn't manually double check all of these results against the online database. But it seems to check out for a couple of the places I eyeballed. It's also worth noting that the vast majority of Starbucks did very well...a handful of mishaps in a year-and-a-half isn't too bad at all.

For example, here's the entry for Starbucks on 333 Market St, with the October 3, 2014 violation of "High risk food holding temperature":

dannguyen/csvkit-sql-cli-readme.md