Created
November 4, 2019 03:34
-
-
Save MonkmanMH/7b1134a8454cee3194ea0621a44f4d42 to your computer and use it in GitHub Desktop.
Answer key for tidyverse transform data exercises
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Transform Data" | |
subtitle: "hands-on examples, with answers" | |
output: html_notebook | |
--- | |
<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio and https://github.com/cwickham/data-science-in-tidyverse-solutions. --> | |
```{r setup} | |
library(tidyverse) | |
library(gapminder) | |
``` | |
# Transform Data 1: manipulate cases | |
## gapminder | |
Take a quick look at the data in the {gapminder} package. | |
```{r} | |
gapminder | |
``` | |
## Your Turn 1.1: filter() rows | |
See if you can use the logical operators to manipulate our code below to show: | |
The data for Canada | |
```{r} | |
# example | |
filter(gapminder, country == "New Zealand") | |
# answer | |
filter(gapminder, country == "Canada") | |
``` | |
All data for countries in Oceania | |
```{r} | |
# example | |
filter(gapminder, country == "New Zealand") | |
# answer | |
filter(gapminder, continent == "Oceania") | |
``` | |
Rows where the life expectancy is greater than 82 | |
```{r} | |
# example | |
filter(gapminder, country == "New Zealand") | |
# answer | |
filter(gapminder, lifeExp > 82) | |
``` | |
## Your Turn 1.2: multiple criteria | |
Use Boolean operators to alter the code below to return only the rows that contain: | |
* United States before 1970 | |
```{r} | |
# example | |
gapminder %>% # Note use of `%>%` pipe symbol | |
filter(country == "New Zealand", year > 2000) | |
# answer | |
gapminder %>% | |
filter(country == "United States", year < 1970) | |
# alternative answer | |
gapminder %>% | |
filter(country == "United States" & year < 1970) # Note use of `&` rather than comma | |
``` | |
* Countries where life expectancy in 2007 is below 50 or over 75 | |
```{r} | |
# answer | |
gapminder %>% | |
filter(year == 2007, lifeExp < 50 | lifeExp > 75) # Note that you have to state `lifeExp` twice, as | |
# there are two comparisons. | |
# What happens if you replace the comma after 2007 with `&`? | |
# - Why do you think this might be happening? | |
``` | |
* Create a list object and return records for any of "New Zealand", "Canada" or "United States" | |
```{r} | |
# answer | |
gapminder %>% | |
filter(country %in% c("New Zealand", "Canada", "United States")) | |
# alternate answer | |
country_list <- c("New Zealand", "Canada", "United States") # create object that is list of countries | |
gapminder %>% | |
filter(country %in% country_list) # filter uses that object | |
``` | |
* Return records for all continents except Europe | |
```{r} | |
# answer | |
gapminder %>% | |
filter(continent != "Europe") | |
# answer with a double-check | |
gapminder %>% | |
filter(continent != "Europe") %>% | |
distinct(continent) # the `distinct()` function returns the list of all the | |
# continent names after the filter is applied | |
``` | |
## Your Turn 1.3: sorting with `arrange()` | |
Find the records with the smallest population. | |
```{r} | |
#answer | |
gapminder %>% | |
arrange(pop) | |
``` | |
Find the records with the largest GDP per capita. | |
```{r} | |
#answer | |
gapminder %>% arrange(desc(gdpPercap)) | |
#answer | |
arrange(gapminder, desc(gdpPercap)) | |
#answer | |
gapminder %>% arrange(-gdpPercap) | |
``` | |
## Your Turn 1.4: creating new variables | |
Calculate a new variable with total GDP | |
```{r, error = TRUE} | |
# answer | |
gapminder %>% | |
mutate(totalGDP = pop * gdpPercap) | |
``` | |
Create a summary table with the population and GDP by continent for the year 1952 | |
```{r} | |
# answer | |
gapminder %>% | |
filter(year == 1952) %>% # filter to get the records we need | |
mutate(totalGDP = pop * gdpPercap) %>% # note that we need to add the new variable calculated above! | |
group_by(continent) %>% # this defines the grouping category | |
summarize(pop = sum(pop), # | |
totalGDP = sum(totalGDP)) # | |
``` | |
## Your Turn 1.5 | |
Alter the code to add a `prev_lifeExp` column that contains the life expectancy from the previous record. | |
(Hint: use "Transformation" cheatsheet, you want to offset elements by one) | |
Extra challenge: Why isn't this quite the 'life expectency five years ago'? | |
```{r} | |
# answer | |
gapminder %>% | |
mutate(prev_lifeExp = lag(lifeExp)) | |
# challenge hint: | |
# - note that there is an `NA` in the first row, since there's no prior row to pull from | |
# - look at the `prev_lifeExp` when the country changes from Afghanistan to Albania | |
``` | |
## Take aways | |
* Extract cases with `filter()` | |
* Filter using a list with `%in%` | |
* Make new variables, with `mutate()` | |
* Connect operations with `%>%` | |
*** | |
# Transform Data 2: summarize | |
<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. --> | |
```{r} | |
library(tidyverse) | |
library(gapminder) | |
``` | |
## gapminder | |
Take a quick look at the data in the {gapminder} package. | |
```{r} | |
gapminder | |
``` | |
## Your Turn 2.1 | |
Use summarise() to compute three statistics about the data: | |
* The first (minimum) year in the dataset | |
* The last (maximum) year in the dataset | |
* The number of unique countries | |
```{r} | |
# answer | |
gapminder %>% | |
summarize(year_min = min(year), | |
year_max = max(year), | |
n_countries = n_distinct(country)) | |
``` | |
## Your Turn 2.2 | |
Extract the rows for African countries in 2007. | |
Then find: | |
1. The number of unique countries | |
2. The median life expectancy of African countries as a group | |
```{r} | |
# answer | |
gapminder %>% | |
filter(continent == "Africa", year == 2007) %>% | |
summarise(n_countries = n_distinct(country), | |
med_life_exp = median(lifeExp)) | |
``` | |
## Your Turn 2.3 | |
Find the median life expectancy by continent in 2007. | |
```{r} | |
gapminder %>% | |
filter(year == 2007) %>% | |
group_by(continent) %>% | |
summarise(med_life_exp = median(lifeExp)) | |
``` | |
## Your Turn 2.4 | |
Find the median life expectancy by continent for each year in the dataframe. | |
```{r} | |
gapminder %>% | |
group_by(continent, year) %>% | |
summarize(median_lifeExp = median(lifeExp)) | |
``` | |
## Your Turn 2.5 | |
Brainstorm with your neighbor the sequence of operations to find: the country with biggest jump in life expectancy (between any two consecutive records) for each continent. | |
## Your Turn 2.6 | |
Find the country with biggest jump in life expectancy (between any two consecutive records) for each continent. | |
```{r} | |
# One of many solutions | |
gapminder %>% | |
group_by(country) %>% | |
mutate(prev_lifeExp = lag(lifeExp), | |
jump = lifeExp - prev_lifeExp) %>% | |
arrange(desc(jump)) | |
``` | |
## Take aways | |
* Make tables of summaries with `summarise()` | |
* Do groupwise operations with `group_by()` | |
*** | |
# Transform Data 3: pivot | |
<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License --> | |
```{r} | |
library(tidyverse) | |
# data packages | |
library(gapminder) | |
``` | |
## Your Turn 3.1 | |
A pivot table crosstab using the {mpg} data package of automobile fuel economy. | |
First, a look at the source table. | |
```{r} | |
mpg | |
``` | |
Using `group_by` and `summarise` to create a summary table of the average engine displacement by vehicle class: | |
```{r} | |
mpg %>% | |
group_by(class) %>% | |
summarise(displ_mean = mean(displ)) | |
``` | |
Now do the same with vehicle class and number of cylinders: | |
```{r} | |
# answer | |
class_by_cyl <- mpg %>% # note that an ojbect `class_by_cyl` is created | |
group_by(class, cyl) %>% | |
summarise(displ_mean = mean(displ)) | |
class_by_cyl # in order to print the object, just name it | |
``` | |
What happens if you reverse the order of the grouping variables? | |
Create a pivot (crosstab) table with class by number of cylinders: | |
```{r} | |
# answer | |
# pivot table (wide) | |
class_by_cyl_pivot <- class_by_cyl %>% | |
pivot_wider(names_from = cyl, values_from = displ_mean) | |
class_by_cyl_pivot | |
``` | |
Now, unpivot it back to the original structure: | |
```{r} | |
# and back to longer... | |
displ_class_by_cyl <- class_by_cyl_pivot %>% | |
pivot_longer(-class, names_to = "cyl", values_to = "displ_mean") | |
displ_class_by_cyl | |
``` | |
What do you notice about the structure of the unpivoted table? | |
## Your Turn 3.2 | |
Making some summary tables with the {gapminder} data. | |
Create a table where: | |
* object name `lifeExp_by_continent` that has | |
* the median life expectency for each continent for every year in the dataframe, | |
* where the continent is the column and the year is the row. | |
(This is a restructuring of the table from 2.4) | |
```{r} | |
# answer | |
lifeExp_by_continent <- gapminder %>% | |
group_by(continent, year) %>% | |
summarize(median_lifeExp = median(lifeExp)) %>% | |
pivot_wider(names_from = continent, values_from = median_lifeExp) | |
lifeExp_by_continent | |
``` | |
## Your Turn 3.3 | |
Create a table that shows: | |
* GDP per capita for | |
* the countries Canada, United States, and Mexico are the columns, and | |
* the years after 1980 are the rows | |
```{r} | |
gapminder %>% | |
filter(year > 1980, | |
country %in% c("Canada", "United States", "Mexico")) %>% | |
select(country, year, gdpPercap) %>% # select() is covered in section 4 | |
pivot_wider(names_from = country, values_from = gdpPercap) | |
``` | |
## Take aways | |
* Restructure data tabulations with `pivot_wider()` and `pivot_longer` | |
*** | |
# Transform data 4: select() | |
<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License --> | |
```{r} | |
library(tidyverse) | |
# data packages | |
library(gapminder) | |
``` | |
## Your Turn 4.1 | |
{gapminder} again | |
```{r} | |
gapminder | |
``` | |
Using `select()` to reduce the number of variables: | |
```{r} | |
# answer | |
gapminder %>% | |
select(country, year, lifeExp) | |
``` | |
You can use the minus sign to drop variable--that is, to define ones to _not_ select. | |
In this example, note we can use `c()` to create a list: | |
```{r} | |
# answer | |
gapminder %>% | |
select(-(c(continent, pop, gdpPercap))) | |
# alternate answer | |
gapminder %>% | |
select(-continent, -pop, -gdpPercap) | |
``` | |
You can define a range of columns to select by using the `:` | |
```{r} | |
# answer | |
gapminder %>% | |
select(country, year:pop) | |
``` | |
## Take aways | |
* Restructure data frames with `select()` and `pull()` | |
* Reinforce other data transformation functions | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment