We want to calculate the county (name, FIPS) that corresponds to each row of this dataset: https://data.medicare.gov/Nursing-Home-Compare/Quality-Measures-Long-Stay/iqd3-nsf3
We will use CartoDB to add new columns that will contain the latitude, longitude, county name, and county FIPS code.
Since the original data includes multiple rows for each facility, and we are only interested in codes 411 and 415, we can delete all the other rows. This will result in a much smaller file that will upload to CartoDB much more quickly.
- Open the
Quality_Measures_-_Long_Stay.csv
file in Excel - Click the "Data" tab, then "Filter"
- Click the menu at the top of the "Measure Code" column, select just the "411" and "415" values, and click "OK"
- Copy the filtered data by pressing CTRL-A (select all), CTRL-C (copy)
- Create a new sheet and paste the copied data
- Save the new sheet to a new file
longstay_filtered.csv
The longstay dataset has various issues that we need to clean up. The location
values have newline characters, which may cause problems for some software. Here's an example (there are invisible linebreak characters after each line):
701 MONROE STREET NW
RUSSELLVILLE, AL 35653
(34.514971, -87.736372)
We'll also want to convert the quarterly measure scores to numbers. (Currently, they are text values that include percent signs.)
The dataset already includes latitude/longitude coordinates, but these are jammed into the location
column, appearing in parentheses after the full street address. In order to work with these coordinates as data, we will first need to extract these coordinates to separate columns. Finally, we will also extract the zipcodes to a new column, so that we can analyze the data by zipcode later on.
-
Load the
longstay_filtered
csv into CartoDB -
Clean up those newlines in the location column by changing them all to spaces. Go to the SQL tab (on right) and run this query:
UPDATE longstay_filtered SET location = regexp_replace(location, '\n', ' ', 'g')
-
Remove the percent signs from the scores:
UPDATE longstay_filtered SET q1_measure_score = replace(q1_measure_score, '%', ''), q2_measure_score = replace(q2_measure_score, '%', ''), q3_measure_score = replace(q3_measure_score, '%', ''), three_quarter_average = replace(three_quarter_average, '%', '')
-
Convert each of those four columns to numeric values by clicking the column header > Change data type > number
-
Run this SQL to add new column
_zip
and extract the zipcode (if any) from thelocation
column:ALTER TABLE longstay_filtered ADD COLUMN _zip text; UPDATE longstay_filtered SET _zip = (regexp_matches(location, ' \w{2} (\d{5}) ($|\()'))[1]
-
Run this SQL to add new columns
_lat
and_lon
and set them to the values (if any) found in thelocation
column:ALTER TABLE longstay_filtered ADD COLUMN _lat double precision; ALTER TABLE longstay_filtered ADD COLUMN _lon double precision; UPDATE longstay_filtered SET _lat = regexp_replace(location, '.*\((-?\d+\.\d+), ?(-?\d+\.\d+)\).*', '\1')::numeric, _lon = regexp_replace(location, '.*\((-?\d+\.\d+), ?(-?\d+\.\d+)\).*', '\2')::numeric WHERE position('(' in location) > 0
-
Click the
_lat
column, select "Georeference..." and specify the longitude and latitude columns. This turns each row into a point on the map. You can switch to the "Map View" (button at top center of page) to view these points on a map.
Back in the "Data View", you may have noticed that there were some location
values in the data table that lack point coordinates. It appears that at least some of these are associated with faulty addresses (like "2018 SALISBURY MD 21802"). Depending on the goals of the project, these rows might need to be corrected, or they could just be discarded from the analysis. To view a list of the rows lacking point locations, run this SQL:
SELECT DISTINCT federal_provider_number, provider_name, location
FROM longstay_filtered
WHERE the_geom IS NULL
Load the 2015 Census county boundaries into CartoDB: http://www2.census.gov/geo/tiger/TIGER2015/COUNTY/tl_2015_us_county.zip
(NOTE: The following method was also tested using the 500k generalized boundaries, which is a much smaller dataset, but the boundaries are not quite as precise. Comparing the results between both sets of boundaries, 8 of the locations that were very close to the county boundary were identified as different counties, so it is probably best to use the more precise boundaries.)
- Run this SQL to add new columns
_countyname
and_countyfips
, and calculate the values for each point:
ALTER TABLE longstay_filtered ADD COLUMN _countyname text;
ALTER TABLE longstay_filtered ADD COLUMN _countyfips text;
UPDATE longstay_filtered AS l
SET
_countyname = c.namelsad,
_countyfips = c.geoid
FROM tl_2015_us_county AS c
WHERE ST_Within(l.the_geom, c.the_geom)
To save a CSV file of the enhanced dataset with the new columns, click the "Edit" link at the top right of the CartoDB window and select "Export layer". Save it in CSV format as longstay_county.csv
. (It may take a few moments before downloading.)
When you open the .csv file, notice that the new columns (_lat
, _lon
, _countyname
, _countyfips
) are on the right side of the table. There are also two new columns (cartodb_id
, the_geom
) on the left side of the table -- these are specific to CartoDB and can be safely discarded.
We have a table (call it county_pac_snf.csv
) of demographic and PAC/SNF data by county, and would like to join this data to the longstay data based upon the county.
We will join this data to the longstay data using the county FIPS code. Unfortunately, the original Excel file containing the PAC/SNF data did not properly define the FIPS code as text, and so they get imported as numbers (losing the leading zeros in many cases). But we can fix this in CartoDB.
- First, import the
county_pac_snf.csv
table into CartoDB. - In the data view, click the
fips
column > Change data type... > string - Run this SQL query:
UPDATE county_pac_snf
SET fips = lpad(fips, 5, '0')
Another issue is that some rows (mostly state totals) have an invalid FIPS code. Delete them with this SQL query:
DELETE FROM county_pac_snf
WHERE fips='00000'
Now we can join the two tables based upon the FIPS code.
- Go to the
longstay_filtered
table and look at the data view - Click the "Merge Datasets" button (bottom-right)
- Select "Column Join"
- On the left (
longstay_filtered
), select_countyfips
- On the right select the
county_pac_snf
table, and thefips
column - Click "Next Step" at the bottom
- Choose which columns you want to include from each table (or just accept the defaults)
- Click "Merge datasets"
The merged data will be saved as a new table called longstay_filtered_merge
.
You can export this table as CSV (see step 3).
The following SQL will calculate the average 411 scores by zipcode:
SELECT
_zip,
count(*),
avg(q1_measure_score) AS q1avg,
avg(q2_measure_score) AS q2avg,
avg(q3_measure_score) AS q3avg,
avg(three_quarter_average) AS q123avg
FROM longstay_filtered
WHERE measure_code = 411
GROUP BY _zip
To save a CSV file of the query results, click the "Edit" link at the top right of the CartoDB window and select "Export layer". Save as longstay_411_score_by_zipcode.csv
.
Re-run the query for measure code 415, and export the results as longstay_415_score_by_zipcode.csv
.