Skip to content

Instantly share code, notes, and snippets.

@MonkmanMH
Created November 4, 2019 03:34
Show Gist options
  • Save MonkmanMH/7b1134a8454cee3194ea0621a44f4d42 to your computer and use it in GitHub Desktop.
Save MonkmanMH/7b1134a8454cee3194ea0621a44f4d42 to your computer and use it in GitHub Desktop.
Answer key for tidyverse transform data exercises
---
title: "Transform Data"
subtitle: "hands-on examples, with answers"
output: html_notebook
---
<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio and https://github.com/cwickham/data-science-in-tidyverse-solutions. -->
```{r setup}
library(tidyverse)
library(gapminder)
```
# Transform Data 1: manipulate cases
## gapminder
Take a quick look at the data in the {gapminder} package.
```{r}
gapminder
```
## Your Turn 1.1: filter() rows
See if you can use the logical operators to manipulate our code below to show:
The data for Canada
```{r}
# example
filter(gapminder, country == "New Zealand")
# answer
filter(gapminder, country == "Canada")
```
All data for countries in Oceania
```{r}
# example
filter(gapminder, country == "New Zealand")
# answer
filter(gapminder, continent == "Oceania")
```
Rows where the life expectancy is greater than 82
```{r}
# example
filter(gapminder, country == "New Zealand")
# answer
filter(gapminder, lifeExp > 82)
```
## Your Turn 1.2: multiple criteria
Use Boolean operators to alter the code below to return only the rows that contain:
* United States before 1970
```{r}
# example
gapminder %>% # Note use of `%>%` pipe symbol
filter(country == "New Zealand", year > 2000)
# answer
gapminder %>%
filter(country == "United States", year < 1970)
# alternative answer
gapminder %>%
filter(country == "United States" & year < 1970) # Note use of `&` rather than comma
```
* Countries where life expectancy in 2007 is below 50 or over 75
```{r}
# answer
gapminder %>%
filter(year == 2007, lifeExp < 50 | lifeExp > 75) # Note that you have to state `lifeExp` twice, as
# there are two comparisons.
# What happens if you replace the comma after 2007 with `&`?
# - Why do you think this might be happening?
```
* Create a list object and return records for any of "New Zealand", "Canada" or "United States"
```{r}
# answer
gapminder %>%
filter(country %in% c("New Zealand", "Canada", "United States"))
# alternate answer
country_list <- c("New Zealand", "Canada", "United States") # create object that is list of countries
gapminder %>%
filter(country %in% country_list) # filter uses that object
```
* Return records for all continents except Europe
```{r}
# answer
gapminder %>%
filter(continent != "Europe")
# answer with a double-check
gapminder %>%
filter(continent != "Europe") %>%
distinct(continent) # the `distinct()` function returns the list of all the
# continent names after the filter is applied
```
## Your Turn 1.3: sorting with `arrange()`
Find the records with the smallest population.
```{r}
#answer
gapminder %>%
arrange(pop)
```
Find the records with the largest GDP per capita.
```{r}
#answer
gapminder %>% arrange(desc(gdpPercap))
#answer
arrange(gapminder, desc(gdpPercap))
#answer
gapminder %>% arrange(-gdpPercap)
```
## Your Turn 1.4: creating new variables
Calculate a new variable with total GDP
```{r, error = TRUE}
# answer
gapminder %>%
mutate(totalGDP = pop * gdpPercap)
```
Create a summary table with the population and GDP by continent for the year 1952
```{r}
# answer
gapminder %>%
filter(year == 1952) %>% # filter to get the records we need
mutate(totalGDP = pop * gdpPercap) %>% # note that we need to add the new variable calculated above!
group_by(continent) %>% # this defines the grouping category
summarize(pop = sum(pop), #
totalGDP = sum(totalGDP)) #
```
## Your Turn 1.5
Alter the code to add a `prev_lifeExp` column that contains the life expectancy from the previous record.
(Hint: use "Transformation" cheatsheet, you want to offset elements by one)
Extra challenge: Why isn't this quite the 'life expectency five years ago'?
```{r}
# answer
gapminder %>%
mutate(prev_lifeExp = lag(lifeExp))
# challenge hint:
# - note that there is an `NA` in the first row, since there's no prior row to pull from
# - look at the `prev_lifeExp` when the country changes from Afghanistan to Albania
```
## Take aways
* Extract cases with `filter()`
* Filter using a list with `%in%`
* Make new variables, with `mutate()`
* Connect operations with `%>%`
***
# Transform Data 2: summarize
<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
```{r}
library(tidyverse)
library(gapminder)
```
## gapminder
Take a quick look at the data in the {gapminder} package.
```{r}
gapminder
```
## Your Turn 2.1
Use summarise() to compute three statistics about the data:
* The first (minimum) year in the dataset
* The last (maximum) year in the dataset
* The number of unique countries
```{r}
# answer
gapminder %>%
summarize(year_min = min(year),
year_max = max(year),
n_countries = n_distinct(country))
```
## Your Turn 2.2
Extract the rows for African countries in 2007.
Then find:
1. The number of unique countries
2. The median life expectancy of African countries as a group
```{r}
# answer
gapminder %>%
filter(continent == "Africa", year == 2007) %>%
summarise(n_countries = n_distinct(country),
med_life_exp = median(lifeExp))
```
## Your Turn 2.3
Find the median life expectancy by continent in 2007.
```{r}
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(med_life_exp = median(lifeExp))
```
## Your Turn 2.4
Find the median life expectancy by continent for each year in the dataframe.
```{r}
gapminder %>%
group_by(continent, year) %>%
summarize(median_lifeExp = median(lifeExp))
```
## Your Turn 2.5
Brainstorm with your neighbor the sequence of operations to find: the country with biggest jump in life expectancy (between any two consecutive records) for each continent.
## Your Turn 2.6
Find the country with biggest jump in life expectancy (between any two consecutive records) for each continent.
```{r}
# One of many solutions
gapminder %>%
group_by(country) %>%
mutate(prev_lifeExp = lag(lifeExp),
jump = lifeExp - prev_lifeExp) %>%
arrange(desc(jump))
```
## Take aways
* Make tables of summaries with `summarise()`
* Do groupwise operations with `group_by()`
***
# Transform Data 3: pivot
<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->
```{r}
library(tidyverse)
# data packages
library(gapminder)
```
## Your Turn 3.1
A pivot table crosstab using the {mpg} data package of automobile fuel economy.
First, a look at the source table.
```{r}
mpg
```
Using `group_by` and `summarise` to create a summary table of the average engine displacement by vehicle class:
```{r}
mpg %>%
group_by(class) %>%
summarise(displ_mean = mean(displ))
```
Now do the same with vehicle class and number of cylinders:
```{r}
# answer
class_by_cyl <- mpg %>% # note that an ojbect `class_by_cyl` is created
group_by(class, cyl) %>%
summarise(displ_mean = mean(displ))
class_by_cyl # in order to print the object, just name it
```
What happens if you reverse the order of the grouping variables?
Create a pivot (crosstab) table with class by number of cylinders:
```{r}
# answer
# pivot table (wide)
class_by_cyl_pivot <- class_by_cyl %>%
pivot_wider(names_from = cyl, values_from = displ_mean)
class_by_cyl_pivot
```
Now, unpivot it back to the original structure:
```{r}
# and back to longer...
displ_class_by_cyl <- class_by_cyl_pivot %>%
pivot_longer(-class, names_to = "cyl", values_to = "displ_mean")
displ_class_by_cyl
```
What do you notice about the structure of the unpivoted table?
## Your Turn 3.2
Making some summary tables with the {gapminder} data.
Create a table where:
* object name `lifeExp_by_continent` that has
* the median life expectency for each continent for every year in the dataframe,
* where the continent is the column and the year is the row.
(This is a restructuring of the table from 2.4)
```{r}
# answer
lifeExp_by_continent <- gapminder %>%
group_by(continent, year) %>%
summarize(median_lifeExp = median(lifeExp)) %>%
pivot_wider(names_from = continent, values_from = median_lifeExp)
lifeExp_by_continent
```
## Your Turn 3.3
Create a table that shows:
* GDP per capita for
* the countries Canada, United States, and Mexico are the columns, and
* the years after 1980 are the rows
```{r}
gapminder %>%
filter(year > 1980,
country %in% c("Canada", "United States", "Mexico")) %>%
select(country, year, gdpPercap) %>% # select() is covered in section 4
pivot_wider(names_from = country, values_from = gdpPercap)
```
## Take aways
* Restructure data tabulations with `pivot_wider()` and `pivot_longer`
***
# Transform data 4: select()
<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->
```{r}
library(tidyverse)
# data packages
library(gapminder)
```
## Your Turn 4.1
{gapminder} again
```{r}
gapminder
```
Using `select()` to reduce the number of variables:
```{r}
# answer
gapminder %>%
select(country, year, lifeExp)
```
You can use the minus sign to drop variable--that is, to define ones to _not_ select.
In this example, note we can use `c()` to create a list:
```{r}
# answer
gapminder %>%
select(-(c(continent, pop, gdpPercap)))
# alternate answer
gapminder %>%
select(-continent, -pop, -gdpPercap)
```
You can define a range of columns to select by using the `:`
```{r}
# answer
gapminder %>%
select(country, year:pop)
```
## Take aways
* Restructure data frames with `select()` and `pull()`
* Reinforce other data transformation functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment