MonkmanMH · November 4, 2019 03:34
diff --git a/transform_data_solutions.Rmd b/transform_data_solutions.Rmd
 ---
 title: "Transform Data"
 subtitle: "hands-on examples, with answers"
 output: html_notebook
 ---

 <!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio and https://github.com/cwickham/data-science-in-tidyverse-solutions. -->

 ```{r setup}
 library(tidyverse)
 library(gapminder)

 ```

 # Transform Data 1: manipulate cases


 ## gapminder

 Take a quick look at the data in the {gapminder} package.

 ```{r}
 gapminder
 ```

 ## Your Turn 1.1: filter() rows

 See if you can use the logical operators to manipulate our code below to show:

 The data for Canada

 ```{r}
 # example
 filter(gapminder, country == "New Zealand")

 # answer
 filter(gapminder, country == "Canada")

 ```

 All data for countries in Oceania

 ```{r}
 # example
 filter(gapminder, country == "New Zealand")

 # answer
 filter(gapminder, continent == "Oceania")
 ```

 Rows where the life expectancy is greater than 82
 ```{r}
 # example
 filter(gapminder, country == "New Zealand")

 # answer
 filter(gapminder, lifeExp > 82)
 ```


 ## Your Turn 1.2: multiple criteria

 Use Boolean operators to alter the code below to return only the rows that contain:

 * United States before 1970

 ```{r}
 # example 
 gapminder %>%                                       # Note use of `%>%` pipe symbol
  filter(country == "New Zealand", year > 2000)

 # answer
 gapminder %>%
  filter(country == "United States", year < 1970)

 # alternative answer
 gapminder %>%
  filter(country == "United States" & year < 1970)  # Note use of `&` rather than comma

 ```

 *  Countries where life expectancy in 2007 is below 50 or over 75

 ```{r}
 # answer
 gapminder %>%
  filter(year == 2007, lifeExp < 50 | lifeExp > 75) # Note that you have to state `lifeExp` twice, as
                                                    # there are two comparisons.
                                                    # What happens if you replace the comma after 2007 with `&`?
                                                    # - Why do you think this might be happening?



 ```



 * Create a list object and return records for any of "New Zealand", "Canada" or "United States"


 ```{r}
 # answer
 gapminder %>%
  filter(country %in% c("New Zealand", "Canada", "United States"))
  
 # alternate answer
 country_list <- c("New Zealand", "Canada", "United States")   # create object that is list of countries
 gapminder %>%
  filter(country %in% country_list)                           # filter uses that object 
  

 ```

 * Return records for all continents except Europe


 ```{r}
 # answer
 gapminder %>%
  filter(continent != "Europe")
  
 # answer with a double-check
 gapminder %>%
  filter(continent != "Europe") %>%
  distinct(continent)                 # the `distinct()` function returns the list of all the 
                                      # continent names after the filter is applied

 ```

 ## Your Turn 1.3: sorting with `arrange()`

 Find the records with the smallest population.
 ```{r}
 #answer
 gapminder %>%
  arrange(pop)

 ```

 Find the records with the largest GDP per capita.
 ```{r}
 #answer
 gapminder %>% arrange(desc(gdpPercap))

 #answer
 arrange(gapminder, desc(gdpPercap))

 #answer
 gapminder %>% arrange(-gdpPercap)


 ```


 ## Your Turn 1.4: creating new variables

 Calculate a new variable with total GDP

 ```{r, error = TRUE}
 # answer
 gapminder %>%
  mutate(totalGDP = pop * gdpPercap)


 ```

 Create a summary table with the population and GDP by continent for the year 1952  

 ```{r}
 # answer
 gapminder %>%
  filter(year == 1952) %>%               # filter to get the records we need
  mutate(totalGDP = pop * gdpPercap) %>% # note that we need to add the new variable calculated above!
  group_by(continent) %>%                # this defines the grouping category
  summarize(pop = sum(pop),              # 
            totalGDP = sum(totalGDP))    #
 ```




 ## Your Turn 1.5

 Alter the code to add a `prev_lifeExp` column that contains the life expectancy from the previous record.

 (Hint: use "Transformation" cheatsheet, you want to offset elements by one)

 Extra challenge: Why isn't this quite the 'life expectency five years ago'?

 ```{r}
 # answer
 gapminder %>%
  mutate(prev_lifeExp = lag(lifeExp))

 # challenge hint: 
 # - note that there is an `NA` in the first row, since there's no prior row to pull from
 # - look at the `prev_lifeExp` when the country changes from Afghanistan to Albania 
 ```


 ## Take aways

 * Extract cases with `filter()`  
 * Filter using a list with `%in%`
 * Make new variables, with `mutate()`  
 * Connect operations with `%>%`  

 ***

 # Transform Data 2: summarize

 <!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->


 ```{r}
 library(tidyverse)
 library(gapminder)

 ```

 ## gapminder

 Take a quick look at the data in the {gapminder} package.

 ```{r}
 gapminder
 ```



 ## Your Turn 2.1

 Use summarise() to compute three statistics about the data:

 * The first (minimum) year in the dataset
 * The last (maximum) year in the dataset
 * The number of unique countries

 ```{r}
 # answer
 gapminder %>%
  summarize(year_min = min(year),
            year_max = max(year),
            n_countries = n_distinct(country))

 ```

 ## Your Turn 2.2

 Extract the rows for African countries in 2007.

 Then find:

 1. The number of unique countries
 2. The median life expectancy of African countries as a group

 ```{r}
 # answer
 gapminder %>% 
  filter(continent == "Africa", year == 2007) %>% 
  summarise(n_countries = n_distinct(country),
    med_life_exp = median(lifeExp))


 ```


 ## Your Turn 2.3

 Find the median life expectancy by continent in 2007.

 ```{r}
 gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>% 
  summarise(med_life_exp = median(lifeExp)) 

 ```


 ## Your Turn 2.4

 Find the median life expectancy by continent for each year in the dataframe.

 ```{r}

 gapminder %>%
  group_by(continent, year) %>%
  summarize(median_lifeExp = median(lifeExp))

 ```
 ## Your Turn 2.5

 Brainstorm with your neighbor the sequence of operations to find:  the country with biggest jump in life expectancy  (between any two consecutive records) for each continent.

 ## Your Turn 2.6

 Find the country with biggest jump in life expectancy (between any two consecutive records) for each continent.

 ```{r}
 # One of many solutions
 gapminder %>%
  group_by(country) %>%
  mutate(prev_lifeExp = lag(lifeExp),
         jump = lifeExp - prev_lifeExp) %>%
  arrange(desc(jump))


 ```


 ## Take aways

 * Make tables of summaries with `summarise()`  
 * Do groupwise operations with `group_by()`

 ***



 # Transform Data 3: pivot

 <!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->


 ```{r}
 library(tidyverse)
 # data packages
 library(gapminder)

 ```


 ## Your Turn 3.1

 A pivot table crosstab using the {mpg} data package of automobile fuel economy.

 First, a look at the source table.

 ```{r}
 mpg
 ```

 Using `group_by` and `summarise` to create a summary table of the average engine displacement by vehicle class:

 ```{r}
 mpg %>%                                
  group_by(class) %>%                   
  summarise(displ_mean = mean(displ)) 

 ```

 Now do the same with vehicle class and number of cylinders:

 ```{r}
 # answer
 class_by_cyl <- mpg %>%                 # note that an ojbect `class_by_cyl` is created
  group_by(class, cyl) %>%
  summarise(displ_mean = mean(displ)) 

 class_by_cyl                            # in order to print the object, just name it

 ```


 What happens if you reverse the order of the grouping variables?



 Create a pivot (crosstab) table with class by number of cylinders:


 ```{r}
 # answer
 # pivot table (wide)

 class_by_cyl_pivot <- class_by_cyl %>%
  pivot_wider(names_from = cyl, values_from = displ_mean)

 class_by_cyl_pivot


 ```

 Now, unpivot it back to the original structure:

 ```{r}
 # and back to longer...

 displ_class_by_cyl <- class_by_cyl_pivot %>%
  pivot_longer(-class, names_to = "cyl", values_to = "displ_mean")

 displ_class_by_cyl

 ```

 What do you notice about the structure of the unpivoted table?



 ## Your Turn 3.2

 Making some summary tables with the {gapminder} data.

 Create a table where:

 * object name `lifeExp_by_continent` that has 
 * the median life expectency for each continent for every year in the dataframe, 
 * where the continent is the column and the year is the row.

 (This is a restructuring of the table from 2.4)

 ```{r}
 # answer

 lifeExp_by_continent <- gapminder %>%
  group_by(continent, year) %>%
  summarize(median_lifeExp = median(lifeExp)) %>%
  pivot_wider(names_from = continent, values_from = median_lifeExp)

 lifeExp_by_continent

 ```

 ## Your Turn 3.3

 Create a table that shows:

 * GDP per capita for
 * the countries Canada, United States, and Mexico are the columns, and 
 * the years after 1980 are the rows

 ```{r}

 gapminder %>%
  filter(year > 1980,
         country %in% c("Canada", "United States", "Mexico")) %>%
  select(country, year, gdpPercap) %>%        # select() is covered in section 4
  pivot_wider(names_from = country, values_from = gdpPercap)

 ```



 ## Take aways

 * Restructure data tabulations with `pivot_wider()` and `pivot_longer`  


 ***

 # Transform data 4: select()

 <!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->


 ```{r}
 library(tidyverse)
 # data packages
 library(gapminder)

 ```


 ## Your Turn 4.1

 {gapminder} again

 ```{r}
 gapminder
 ```

 Using `select()` to reduce the number of variables:

 ```{r}
 # answer
 gapminder %>%                                
  select(country, year, lifeExp) 

 ```



 You can use the minus sign to drop variable--that is, to define ones to _not_ select. 

 In this example, note we can use `c()` to create a list:


 ```{r}
 # answer
 gapminder %>%                                
  select(-(c(continent, pop, gdpPercap))) 

 # alternate answer
 gapminder %>%                                
  select(-continent, -pop, -gdpPercap) 

 ```



 You can define a range of columns to select by using the `:`

 ```{r}
 # answer
 gapminder %>%
  select(country, year:pop)
 ```


 ## Take aways

 * Restructure data frames with `select()` and `pull()`  
 * Reinforce other data transformation functions
	---
	title: "Transform Data"
	subtitle: "hands-on examples, with answers"
	output: html_notebook
	---

	<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio and https://github.com/cwickham/data-science-in-tidyverse-solutions. -->

	```{r setup}
	library(tidyverse)
	library(gapminder)

	```

	# Transform Data 1: manipulate cases


	## gapminder

	Take a quick look at the data in the {gapminder} package.

	```{r}
	gapminder
	```

	## Your Turn 1.1: filter() rows

	See if you can use the logical operators to manipulate our code below to show:

	The data for Canada

	```{r}
	# example
	filter(gapminder, country == "New Zealand")

	# answer
	filter(gapminder, country == "Canada")

	```

	All data for countries in Oceania

	```{r}
	# example
	filter(gapminder, country == "New Zealand")

	# answer
	filter(gapminder, continent == "Oceania")
	```

	Rows where the life expectancy is greater than 82
	```{r}
	# example
	filter(gapminder, country == "New Zealand")

	# answer
	filter(gapminder, lifeExp > 82)
	```


	## Your Turn 1.2: multiple criteria

	Use Boolean operators to alter the code below to return only the rows that contain:

	* United States before 1970

	```{r}
	# example
	gapminder %>% # Note use of `%>%` pipe symbol
	filter(country == "New Zealand", year > 2000)

	# answer
	gapminder %>%
	filter(country == "United States", year < 1970)

	# alternative answer
	gapminder %>%
	filter(country == "United States" & year < 1970) # Note use of `&` rather than comma

	```

	* Countries where life expectancy in 2007 is below 50 or over 75

	```{r}
	# answer
	gapminder %>%
	filter(year == 2007, lifeExp < 50 \| lifeExp > 75) # Note that you have to state `lifeExp` twice, as
	# there are two comparisons.
	# What happens if you replace the comma after 2007 with `&`?
	# - Why do you think this might be happening?



	```



	* Create a list object and return records for any of "New Zealand", "Canada" or "United States"


	```{r}
	# answer
	gapminder %>%
	filter(country %in% c("New Zealand", "Canada", "United States"))

	# alternate answer
	country_list <- c("New Zealand", "Canada", "United States") # create object that is list of countries
	gapminder %>%
	filter(country %in% country_list) # filter uses that object


	```

	* Return records for all continents except Europe


	```{r}
	# answer
	gapminder %>%
	filter(continent != "Europe")

	# answer with a double-check
	gapminder %>%
	filter(continent != "Europe") %>%
	distinct(continent) # the `distinct()` function returns the list of all the
	# continent names after the filter is applied

	```

	## Your Turn 1.3: sorting with `arrange()`

	Find the records with the smallest population.
	```{r}
	#answer
	gapminder %>%
	arrange(pop)

	```

	Find the records with the largest GDP per capita.
	```{r}
	#answer
	gapminder %>% arrange(desc(gdpPercap))

	#answer
	arrange(gapminder, desc(gdpPercap))

	#answer
	gapminder %>% arrange(-gdpPercap)


	```


	## Your Turn 1.4: creating new variables

	Calculate a new variable with total GDP

	```{r, error = TRUE}
	# answer
	gapminder %>%
	mutate(totalGDP = pop * gdpPercap)


	```

	Create a summary table with the population and GDP by continent for the year 1952

	```{r}
	# answer
	gapminder %>%
	filter(year == 1952) %>% # filter to get the records we need
	mutate(totalGDP = pop * gdpPercap) %>% # note that we need to add the new variable calculated above!
	group_by(continent) %>% # this defines the grouping category
	summarize(pop = sum(pop), #
	totalGDP = sum(totalGDP)) #
	```




	## Your Turn 1.5

	Alter the code to add a `prev_lifeExp` column that contains the life expectancy from the previous record.

	(Hint: use "Transformation" cheatsheet, you want to offset elements by one)

	Extra challenge: Why isn't this quite the 'life expectency five years ago'?

	```{r}
	# answer
	gapminder %>%
	mutate(prev_lifeExp = lag(lifeExp))

	# challenge hint:
	# - note that there is an `NA` in the first row, since there's no prior row to pull from
	# - look at the `prev_lifeExp` when the country changes from Afghanistan to Albania
	```


	## Take aways

	* Extract cases with `filter()`
	* Filter using a list with `%in%`
	* Make new variables, with `mutate()`
	* Connect operations with `%>%`

	***

	# Transform Data 2: summarize

	<!-- This file by Charlotte Wickham (with some modifications by Martin Monkman) is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->


	```{r}
	library(tidyverse)
	library(gapminder)

	```

	## gapminder

	Take a quick look at the data in the {gapminder} package.

	```{r}
	gapminder
	```



	## Your Turn 2.1

	Use summarise() to compute three statistics about the data:

	* The first (minimum) year in the dataset
	* The last (maximum) year in the dataset
	* The number of unique countries

	```{r}
	# answer
	gapminder %>%
	summarize(year_min = min(year),
	year_max = max(year),
	n_countries = n_distinct(country))

	```

	## Your Turn 2.2

	Extract the rows for African countries in 2007.

	Then find:

	1. The number of unique countries
	2. The median life expectancy of African countries as a group

	```{r}
	# answer
	gapminder %>%
	filter(continent == "Africa", year == 2007) %>%
	summarise(n_countries = n_distinct(country),
	med_life_exp = median(lifeExp))


	```


	## Your Turn 2.3

	Find the median life expectancy by continent in 2007.

	```{r}
	gapminder %>%
	filter(year == 2007) %>%
	group_by(continent) %>%
	summarise(med_life_exp = median(lifeExp))

	```


	## Your Turn 2.4

	Find the median life expectancy by continent for each year in the dataframe.

	```{r}

	gapminder %>%
	group_by(continent, year) %>%
	summarize(median_lifeExp = median(lifeExp))

	```
	## Your Turn 2.5

	Brainstorm with your neighbor the sequence of operations to find: the country with biggest jump in life expectancy (between any two consecutive records) for each continent.

	## Your Turn 2.6

	Find the country with biggest jump in life expectancy (between any two consecutive records) for each continent.

	```{r}
	# One of many solutions
	gapminder %>%
	group_by(country) %>%
	mutate(prev_lifeExp = lag(lifeExp),
	jump = lifeExp - prev_lifeExp) %>%
	arrange(desc(jump))


	```


	## Take aways

	* Make tables of summaries with `summarise()`
	* Do groupwise operations with `group_by()`

	***



	# Transform Data 3: pivot

	<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->


	```{r}
	library(tidyverse)
	# data packages
	library(gapminder)

	```


	## Your Turn 3.1

	A pivot table crosstab using the {mpg} data package of automobile fuel economy.

	First, a look at the source table.

	```{r}
	mpg
	```

	Using `group_by` and `summarise` to create a summary table of the average engine displacement by vehicle class:

	```{r}
	mpg %>%
	group_by(class) %>%
	summarise(displ_mean = mean(displ))

	```

	Now do the same with vehicle class and number of cylinders:

	```{r}
	# answer
	class_by_cyl <- mpg %>% # note that an ojbect `class_by_cyl` is created
	group_by(class, cyl) %>%
	summarise(displ_mean = mean(displ))

	class_by_cyl # in order to print the object, just name it

	```


	What happens if you reverse the order of the grouping variables?



	Create a pivot (crosstab) table with class by number of cylinders:


	```{r}
	# answer
	# pivot table (wide)

	class_by_cyl_pivot <- class_by_cyl %>%
	pivot_wider(names_from = cyl, values_from = displ_mean)

	class_by_cyl_pivot


	```

	Now, unpivot it back to the original structure:

	```{r}
	# and back to longer...

	displ_class_by_cyl <- class_by_cyl_pivot %>%
	pivot_longer(-class, names_to = "cyl", values_to = "displ_mean")

	displ_class_by_cyl

	```

	What do you notice about the structure of the unpivoted table?



	## Your Turn 3.2

	Making some summary tables with the {gapminder} data.

	Create a table where:

	* object name `lifeExp_by_continent` that has
	* the median life expectency for each continent for every year in the dataframe,
	* where the continent is the column and the year is the row.

	(This is a restructuring of the table from 2.4)

	```{r}
	# answer

	lifeExp_by_continent <- gapminder %>%
	group_by(continent, year) %>%
	summarize(median_lifeExp = median(lifeExp)) %>%
	pivot_wider(names_from = continent, values_from = median_lifeExp)

	lifeExp_by_continent

	```

	## Your Turn 3.3

	Create a table that shows:

	* GDP per capita for
	* the countries Canada, United States, and Mexico are the columns, and
	* the years after 1980 are the rows

	```{r}

	gapminder %>%
	filter(year > 1980,
	country %in% c("Canada", "United States", "Mexico")) %>%
	select(country, year, gdpPercap) %>% # select() is covered in section 4
	pivot_wider(names_from = country, values_from = gdpPercap)

	```



	## Take aways

	* Restructure data tabulations with `pivot_wider()` and `pivot_longer`


	***

	# Transform data 4: select()

	<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->


	```{r}
	library(tidyverse)
	# data packages
	library(gapminder)

	```


	## Your Turn 4.1

	{gapminder} again

	```{r}
	gapminder
	```

	Using `select()` to reduce the number of variables:

	```{r}
	# answer
	gapminder %>%
	select(country, year, lifeExp)

	```



	You can use the minus sign to drop variable--that is, to define ones to _not_ select.

	In this example, note we can use `c()` to create a list:


	```{r}
	# answer
	gapminder %>%
	select(-(c(continent, pop, gdpPercap)))

	# alternate answer
	gapminder %>%
	select(-continent, -pop, -gdpPercap)

	```



	You can define a range of columns to select by using the `:`

	```{r}
	# answer
	gapminder %>%
	select(country, year:pop)
	```


	## Take aways

	* Restructure data frames with `select()` and `pull()`
	* Reinforce other data transformation functions