Last active
October 29, 2019 19:04
-
-
Save webbedfeet/e362312cc6c5bf39b05c026bdc0176e6 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#' --- | |
#' title: An annotated R script for Homework 6 using lists and maps | |
#' output: html_document | |
#' --- | |
#' | |
#+, include=FALSE | |
knitr::opts_chunk$set(error=F, warning=F, message=F) | |
#' First, read the data in. We've stored just the data files in the folder `data/HW6`. | |
library(tidyverse) | |
library(rio) | |
fnames <- dir('data/HW6', pattern='csv', full.names=T) | |
# full.names=T ensures that the full relative path to the files is returned | |
# Now pull the data from multiple files into R in one go | |
dats <- import_list(fnames, skip=4) # from the `rio` package | |
#' This command creates a **list** of data frames. If you look at `dats` in the | |
#' Environment tab, you'll see that is is a list of 5 objects. If you click on the | |
#' magnifying glass to the right of that listing, you'll see | |
#' | |
#' ![](img/dats_list.png) | |
#' | |
#' i.e., it is a list of 5 data frames of the same size. | |
#' | |
#' ### Why lists? | |
#' | |
#' Well it has to do with two issues: | |
#' | |
#' 1. If we want to do the same operations over and over again, we use functions | |
#' 2. Applying the same functions to many datasets is most efficient using lists and | |
#' `purrr::map`. `map` inputs a list, applies a function to each member of the list, | |
#' and then outputs the results in a list. | |
#' | |
#' > Just like the `dplyr` pipelines start with a data frame and ends with a data frame, | |
#' you can create a list-based pipeline using `map`, since it starts with a list and | |
#' outputs a list. | |
#' | |
#' To demonstrate: | |
dats <- map(dats, janitor::clean_names) # clean the names of the columns into snake_case | |
#' Here we use the function `clean_names` from the `janitor` package to clean the names | |
#' of the columns to snake_case, without spaces and punctuation. This is an example of | |
#' using a function with default options; the first argument of the function is fed each | |
#' element of the list in turn, and the results are output into corresponding elements of a | |
#' new list. | |
#' | |
#' Next, we want to do the same munging operation on all 5 data sets. This operation | |
#' will remove the first row of a dataset, and then ensure that all the columns are of | |
#' type `numeric`. This can be implemented on the fly using an anonymous function (i.e. | |
#' a function with no name) within the `map` function. | |
#+ , warning = T | |
dats <- map(dats, | |
function(d){ | |
d %>% | |
slice(-1) %>% # Take away first column | |
mutate_all(as.numeric) # Transform all columns to numeric | |
}) | |
#' The above warning happens since for black females, one of the data points is missing | |
#' and is coded as `-` in the data; coercing this value to numeric results in a `NA`. | |
#' | |
#' Next, for each site, we want to create separate data sets for men, women and both sexes. | |
#' Once again, this is an instance of the same operation being performed on | |
#' each component of the list `dats`, so `map` would be appropriate. | |
dats_both <- map(dats, select, year_of_diagnosis, | |
ends_with('sexes')) # Create both sexes | |
dats_male <- map(dats, select, year_of_diagnosis, | |
ends_with('_males')) # Create male datasets | |
dats_female <- map(dats, select, year_of_diagnosis, | |
ends_with('females')) # Create female datasets | |
#' | |
#' ## Join the datasets into 1, by gender | |
#' | |
#' Let's focus on the "both sexes" dataset first; we'll do the same operation on the male | |
#' and female datasets (using a `for`-loop). | |
#' | |
#' There are two ways to put these datasets together. One is to keep the data | |
#' as is, and stack them, with an additional column denoting the site of the cancer | |
#' associated with each row. This can be easily done from a list of data frames, | |
#' provided that you have a **named list**. | |
names(dats_both) | |
#' | |
#' Yes, we have names. | |
#' | |
#' ### Stacking data sets, with a new column denoting the site | |
#' | |
#' If we now want to stack all the data sets together, with a | |
#' column denoting site of each record, we can do | |
#' | |
dats_both_joined <- bind_rows(dats_both, .id = 'Site') | |
#' Here, the `.id` option creates "a new column of identifiers is created to link | |
#' each row to its original data frame. The labels are taken from the named | |
#' arguments to bind_rows(). When a list of data frames is supplied, the | |
#' labels are taken from the names of the list. If no names are found a | |
#' numeric sequence is used instead." | |
str(dats_both_joined) | |
#' | |
#' ### Joining data sets into wide dataset, then use `gather` | |
#' | |
#' An alternative to get to the same point is to use `left_join` successively to create | |
#' a wide dataset. For this we would first need to distinguish the column names for | |
#' each site. This is achieved by the following for loop: | |
for (nm in names(dats_both)){ | |
names(dats_both[[nm]]) <- str_replace(names(dats_both[[nm]]), | |
'both_sexes', nm) | |
names(dats_male[[nm]]) <- str_replace(names(dats_both[[nm]]), | |
'males', nm) | |
names(dats_female[[nm]]) <- str_replace(names(dats_both[[nm]]), | |
'females', nm) | |
} | |
#' In this for loop, first `nm` takes the value `"Brain"`, and so for the | |
#' Brain data, it replaces, for example, the string `"both_sexes"` in the names of | |
#' the data with the string `"Brain"`. So that data set would have column names | |
#' `year_of_diagnosis`, `all_races_Brain`, `whites_Brain` and `black_Brain`. The for loop | |
#' then moves on to the next element, which is `"Colon"`, and replaces, in the recipe, | |
#' the string `"Colon"` everywhere it seens `nm`. and so on. | |
#' | |
#' Conceptually, for the homework, you could now have done the following pipe | |
#' to join all the datasets together: | |
#' ``` | |
#' brain %>% left_join(colon) %>% left_join(esophagus) %>% | |
#' left_join(lung) %>% left_join(oral) | |
#' ``` | |
#' | |
#' This is great, if you can type out all the data sets. For more generality, | |
#' lists and the function `Reduce` come in handy. | |
#' `Reduce` uses a binary function (i.e., a function with two inputs) to | |
#' successively combine the elements of a given vector or list. So it does the | |
#' operation described above for arbitrary sizes of the list. Here our binary function | |
#' will be `left_join` which takes two inputs, namely a left dataset and a right dataset. | |
dats2_both <- Reduce(left_join, dats_both) | |
dats2_male <- Reduce(left_join, dats_male) | |
dats2_female <- Reduce(left_join, dats_female) | |
#' Now we have a bit of data procesisng to do. We need to gather the data into long form, | |
#' and split the gathered key column into site and race. These needs three steps: | |
#' | |
#' 1. gather the dataset into 3 columns (year and the other two) | |
#' 1. Fix `all_races` to `allraces` so that `separate` can separate based on `_`. | |
#' 1. Separate the variable into race and site variables | |
#' | |
#' Since we'll be doing the same operation on all three datasets, we can write a function | |
#' instead of copying-and-pasting. | |
f1 <- function(d){ | |
d %>% | |
tidyr::gather(variable, rate, -year_of_diagnosis) %>% | |
mutate(variable = str_replace(variable, | |
'all_races', | |
'allraces')) %>% | |
separate(variable, c('race','site'), sep='_') | |
} | |
#' Now we can do one of two things: either just do this separately for the three datasets: | |
dats3_both <- f1(dats2_both) | |
dats3_male <- f1(dats2_male) | |
dats3_female <- f1(dats2_female) | |
#' or use lists and maps | |
dats2 <- list('both' = dats2_both, | |
'male' = dats2_male, | |
'female' = dats2_female) | |
dats3 <- map(dats2, f1) | |
#' or, create a pipeline of lists using `map`: | |
dats3 <- dats2 %>% | |
map(tidyr::gather, variable, rate, -year_of_diagnosis) %>% | |
map(mutate, variable = str_replace(variable, 'all_races','allraces')) %>% | |
map(separate, variable, c('race','site'), sep='_') | |
#' Note that the map pipeline is very similar to the pipeline in `f1`. Whereever we | |
#' had `function(...)` in `f1`, we now have `map(function, ...)` where `function` is | |
#' the function name and `...` are the function arguments. | |
#' | |
#' | |
#' The last step is to create the graphs. I'm doing a simpler version where I | |
#' just use a for-loop to create all the plots | |
for(nm in names(dats3)){ | |
plt1 <- dats3[[nm]] %>% | |
filter(race=='allraces') %>% | |
ggplot(aes(x = year_of_diagnosis, | |
y = rate, | |
color = site))+ | |
geom_point() + | |
labs(title = nm, | |
subtitle='allraces') | |
plt2 <- dats3[[nm]] %>% | |
filter(race=='whites') %>% | |
ggplot(aes(x = year_of_diagnosis, | |
y = rate, | |
color = site))+ | |
geom_point()+ | |
labs(title=nm, subtitle='whites') | |
plt3 <- dats3[[nm]] %>% | |
filter(race=='blacks') %>% | |
ggplot(aes(x = year_of_diagnosis, | |
y = rate, | |
color = site))+ | |
geom_point()+ | |
labs(title=nm, subtitle='blacks') | |
print(cowplot::plot_grid(plt1,plt2, plt3, ncol=1)) | |
} | |
#' Using lists and plots, we might do this perhaps a bit more succintly. However | |
#' it is up to you to decide which one is easier for you to understand (code should be human-readable) | |
for(nm in names(dats3)){ | |
plts <- dats3[[nm]] %>% | |
group_split(race) %>% # Splits data into list of datasets based on values of race | |
map(~ggplot(., aes(x = year_of_diagnosis, | |
y = rate, | |
color=site))+ | |
geom_point() + | |
labs(title=nm, subtitle=unique(.$race))) | |
print(cowplot::plot_grid(plotlist=plts, ncol=1)) | |
} | |
#' ## Summary | |
#' | |
#' The coding strategy we've described here works well when you have a bunch of | |
#' standardized datasets (formatted similarly) and you want to do the same operations | |
#' to all of them to achieve some high-throughput computing. It requires both | |
#' | |
#' 1. Data sets in identical standard formats | |
#' 1. The same common operations to be done to all of them | |
#' | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment