Calculate the county for Medicare Long Stay facilities

We want to calculate the county (name, FIPS) that corresponds to each row of this dataset: https://data.medicare.gov/Nursing-Home-Compare/Quality-Measures-Long-Stay/iqd3-nsf3

We will use CartoDB to add new columns that will contain the latitude, longitude, county name, and county FIPS code.

Step 1: Filter the data

Since the original data includes multiple rows for each facility, and we are only interested in codes 411 and 415, we can delete all the other rows. This will result in a much smaller file that will upload to CartoDB much more quickly.

Open the Quality_Measures_-_Long_Stay.csv file in Excel
Click the "Data" tab, then "Filter"
Click the menu at the top of the "Measure Code" column, select just the "411" and "415" values, and click "OK"
Copy the filtered data by pressing CTRL-A (select all), CTRL-C (copy)
Create a new sheet and paste the copied data
Save the new sheet to a new file longstay_filtered.csv

Step 2: Clean up the data and extract the point coordinates

The longstay dataset has various issues that we need to clean up. The location values have newline characters, which may cause problems for some software. Here's an example (there are invisible linebreak characters after each line):

701 MONROE STREET NW
RUSSELLVILLE, AL 35653
(34.514971, -87.736372)

We'll also want to convert the quarterly measure scores to numbers. (Currently, they are text values that include percent signs.)

The dataset already includes latitude/longitude coordinates, but these are jammed into the location column, appearing in parentheses after the full street address. In order to work with these coordinates as data, we will first need to extract these coordinates to separate columns. Finally, we will also extract the zipcodes to a new column, so that we can analyze the data by zipcode later on.

Load the longstay_filtered csv into CartoDB
Clean up those newlines in the location column by changing them all to spaces. Go to the SQL tab (on right) and run this query:
```
UPDATE longstay_filtered
SET location = regexp_replace(location, '\n', ' ', 'g')
```

Remove the percent signs from the scores:

UPDATE longstay_filtered
SET
q1_measure_score = replace(q1_measure_score, '%', ''),
q2_measure_score = replace(q2_measure_score, '%', ''),
q3_measure_score = replace(q3_measure_score, '%', ''),
three_quarter_average = replace(three_quarter_average, '%', '')

Convert each of those four columns to numeric values by clicking the column header > Change data type > number

Run this SQL to add new column _zip and extract the zipcode (if any) from the location column:

ALTER TABLE longstay_filtered ADD COLUMN _zip text;

UPDATE longstay_filtered
SET _zip = (regexp_matches(location, ' \w{2} (\d{5}) ($|\()'))[1]

Run this SQL to add new columns _lat and _lon and set them to the values (if any) found in the location column:

ALTER TABLE longstay_filtered ADD COLUMN _lat double precision;
ALTER TABLE longstay_filtered ADD COLUMN _lon double precision;

UPDATE longstay_filtered
SET
_lat = regexp_replace(location, '.*\((-?\d+\.\d+), ?(-?\d+\.\d+)\).*', '\1')::numeric,
_lon = regexp_replace(location, '.*\((-?\d+\.\d+), ?(-?\d+\.\d+)\).*', '\2')::numeric
WHERE
position('(' in location) > 0

Click the _lat column, select "Georeference..." and specify the longitude and latitude columns. This turns each row into a point on the map. You can switch to the "Map View" (button at top center of page) to view these points on a map.

Back in the "Data View", you may have noticed that there were some location values in the data table that lack point coordinates. It appears that at least some of these are associated with faulty addresses (like "2018 SALISBURY MD 21802"). Depending on the goals of the project, these rows might need to be corrected, or they could just be discarded from the analysis. To view a list of the rows lacking point locations, run this SQL:

SELECT DISTINCT federal_provider_number, provider_name, location
FROM longstay_filtered
WHERE the_geom IS NULL

Step 2: Calculate the county for each point

Load the 2015 Census county boundaries into CartoDB: http://www2.census.gov/geo/tiger/TIGER2015/COUNTY/tl_2015_us_county.zip

(NOTE: The following method was also tested using the 500k generalized boundaries, which is a much smaller dataset, but the boundaries are not quite as precise. Comparing the results between both sets of boundaries, 8 of the locations that were very close to the county boundary were identified as different counties, so it is probably best to use the more precise boundaries.)

Run this SQL to add new columns _countyname and _countyfips, and calculate the values for each point:

ALTER TABLE longstay_filtered ADD COLUMN _countyname text;
ALTER TABLE longstay_filtered ADD COLUMN _countyfips text;

UPDATE longstay_filtered AS l
SET
_countyname = c.namelsad,
_countyfips = c.geoid
FROM tl_2015_us_county AS c
WHERE ST_Within(l.the_geom, c.the_geom)

Step 3: Export to CSV

To save a CSV file of the enhanced dataset with the new columns, click the "Edit" link at the top right of the CartoDB window and select "Export layer". Save it in CSV format as longstay_county.csv. (It may take a few moments before downloading.)

When you open the .csv file, notice that the new columns (_lat, _lon, _countyname, _countyfips) are on the right side of the table. There are also two new columns (cartodb_id, the_geom) on the left side of the table -- these are specific to CartoDB and can be safely discarded.

Step 4: Join county data

We have a table (call it county_pac_snf.csv) of demographic and PAC/SNF data by county, and would like to join this data to the longstay data based upon the county.

We will join this data to the longstay data using the county FIPS code. Unfortunately, the original Excel file containing the PAC/SNF data did not properly define the FIPS code as text, and so they get imported as numbers (losing the leading zeros in many cases). But we can fix this in CartoDB.

First, import the county_pac_snf.csv table into CartoDB.
In the data view, click the fips column > Change data type... > string
Run this SQL query:

UPDATE county_pac_snf
SET fips = lpad(fips, 5, '0')

Another issue is that some rows (mostly state totals) have an invalid FIPS code. Delete them with this SQL query:

DELETE FROM county_pac_snf
WHERE fips='00000'

Now we can join the two tables based upon the FIPS code.

Go to the longstay_filtered table and look at the data view
Click the "Merge Datasets" button (bottom-right)
Select "Column Join"
On the left (longstay_filtered), select _countyfips
On the right select the county_pac_snf table, and the fips column
Click "Next Step" at the bottom
Choose which columns you want to include from each table (or just accept the defaults)
Click "Merge datasets"

The merged data will be saved as a new table called longstay_filtered_merge.

You can export this table as CSV (see step 3).

Step 5: Aggregate scores by zipcode

The following SQL will calculate the average 411 scores by zipcode:

SELECT
_zip,
count(*),
avg(q1_measure_score) AS q1avg,
avg(q2_measure_score) AS q2avg,
avg(q3_measure_score) AS q3avg,
avg(three_quarter_average) AS q123avg
FROM longstay_filtered
WHERE measure_code = 411
GROUP BY _zip

To save a CSV file of the query results, click the "Edit" link at the top right of the CartoDB window and select "Export layer". Save as longstay_411_score_by_zipcode.csv.

Re-run the query for measure code 415, and export the results as longstay_415_score_by_zipcode.csv.

kgjenkins/medicare-long-stay-providers-take2.md