swayson · March 15, 2015 12:43
diff --git a/dplyr-tut-02.Rmd b/dplyr-tut-02.Rmd
 ---
 title: 'Going deeper with dplyr: New features in 0.3 and 0.4'
 output: html_document
 ---

 ## Introduction

 In August 2014, I created a [40-minute video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) introducing the key functionality of the dplyr package in R, using dplyr version 0.2. Since then, there have been two significant updates to dplyr (0.3 and 0.4), introducing a ton of new features.

 This document (created in March 2015) covers the most useful new features in 0.3 and 0.4, as well as other functionality that I didn't cover last time (though it is not necessarily new). My [new video tutorial](https://www.youtube.com/watch?v=2mh1PqfsXVI) walks through the code below in detail.

 **If you have not watched the [previous tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc)**, I recommend you do so first since it covers some dplyr basics that will not be covered in this tutorial.


 ## Loading dplyr and the nycflights13 dataset

 Although my last tutorial used data from the hflights package, Hadley Wickham has rewritten the [dplyr vignettes](http://cran.r-project.org/web/packages/dplyr/index.html) to use the nycflights13 package instead, and so I'm also using nycflights13 for the sake of consistency.

 ```{r eval=FALSE}
 # remove flights data if you just finished my previous tutorial
 rm(flights)
 ```

 ```{r}
 # load packages
 suppressMessages(library(dplyr))
 library(nycflights13)

 # print the flights dataset from nycflights13
 flights
 ```


 ## Choosing columns: select, rename

 ```{r}
 # besides just using select() to pick columns...
 flights %>% select(carrier, flight)

 # ...you can use the minus sign to hide columns
 flights %>% select(-month, -day)
 ```

 ```{r results='hide'}
 # hide a range of columns
 flights %>% select(-(dep_time:arr_delay))

 # hide any column with a matching name
 flights %>% select(-contains("time"))
 ```

 ```{r}
 # pick columns using a character vector of column names
 cols <- c("carrier", "flight", "tailnum")
 flights %>% select(one_of(cols))
 ```

 ```{r}
 # select() can be used to rename columns, though all columns not mentioned are dropped
 flights %>% select(tail = tailnum)

 # rename() does the same thing, except all columns not mentioned are kept
 flights %>% rename(tail = tailnum)
 ```


 ## Choosing rows: filter, between, slice, sample_n, top_n, distinct

 ```{r}
 # filter() supports the use of multiple conditions
 flights %>% filter(dep_time >= 600, dep_time <= 605)
 ```

 ```{r results='hide'}
 # between() is a concise alternative for determing if numeric values fall in a range
 flights %>% filter(between(dep_time, 600, 605))

 # side note: is.na() can also be useful when filtering
 flights %>% filter(!is.na(dep_time))
 ```


 ```{r}
 # slice() filters rows by position
 flights %>% slice(1000:1005)

 # keep the first three rows within each group
 flights %>% group_by(month, day) %>% slice(1:3)

 # sample three rows from each group
 flights %>% group_by(month, day) %>% sample_n(3)

 # keep three rows from each group with the top dep_delay
 flights %>% group_by(month, day) %>% top_n(3, dep_delay)

 # also sort by dep_delay within each group
 flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay))
 ```


 ```{r}
 # unique rows can be identified using unique() from base R
 flights %>% select(origin, dest) %>% unique()
 ```

 ```{r results='hide'}
 # dplyr provides an alternative that is more "efficient"
 flights %>% select(origin, dest) %>% distinct()

 # side note: when chaining, you don't have to include the parentheses if there are no arguments
 flights %>% select(origin, dest) %>% distinct
 ```


 ## Adding new variables: mutate, transmute, add_rownames

 ```{r}
 # mutate() creates a new variable (and keeps all existing variables)
 flights %>% mutate(speed = distance/air_time*60)

 # transmute() only keeps the new variables
 flights %>% transmute(speed = distance/air_time*60)
 ```


 ```{r}
 # example data frame with row names
 mtcars %>% head()

 # add_rownames() turns row names into an explicit variable
 mtcars %>% add_rownames("model") %>% head()

 # side note: dplyr no longer prints row names (ever) for local data frames
 mtcars %>% tbl_df()
 ```


 ## Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup

 ```{r}
 # summarise() can be used to count the number of rows in each group
 flights %>% group_by(month) %>% summarise(cnt = n())
 ```

 ```{r results='hide'}
 # tally() and count() can do this more concisely
 flights %>% group_by(month) %>% tally()
 flights %>% count(month)
 ```

 ```{r}
 # you can sort by the count
 flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt))
 ```

 ```{r results='hide'}
 # tally() and count() have a sort parameter for this purpose
 flights %>% group_by(month) %>% tally(sort=TRUE)
 flights %>% count(month, sort=TRUE)
 ```

 ```{r}
 # you can sum over a specific variable instead of simply counting rows
 flights %>% group_by(month) %>% summarise(dist = sum(distance))
 ```

 ```{r results='hide'}
 # tally() and count() have a wt parameter for this purpose
 flights %>% group_by(month) %>% tally(wt = distance)
 flights %>% count(month, wt = distance)
 ```

 ```{r}
 # group_size() returns the counts as a vector
 flights %>% group_by(month) %>% group_size()

 # n_groups() simply reports the number of groups
 flights %>% group_by(month) %>% n_groups()
 ```

 ```{r}
 # group by two variables, summarise, arrange (output is possibly confusing)
 flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40)

 # ungroup() before arranging to arrange across all groups
 flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt))
 ```


 ## Creating data frames: data_frame

 `data_frame()` is a better way than `data.frame()` for creating data frames. Benefits of `data_frame()`:

 * You can use previously defined columns to compute new columns.
 * It never coerces column types.
 * It never munges column names.
 * It never adds row names. 
 * It only recycles length 1 input.
 * It returns a local data frame (a tbl_df).

 ```{r}
 # data_frame() example
 data_frame(a = 1:6, b = a*2, c = 'string', 'd+e' = 1) %>% glimpse()

 # data.frame() example
 data.frame(a = 1:6, c = 'string', 'd+e' = 1) %>% glimpse()
 ```


 ## Joining (merging) tables: left_join, right_join, inner_join, full_join, semi_join, anti_join

 ```{r}
 # create two simple data frames
 (a <- data_frame(color = c("green","yellow","red"), num = 1:3))
 (b <- data_frame(color = c("green","yellow","pink"), size = c("S","M","L")))

 # only include observations found in both "a" and "b" (automatically joins on variables that appear in both tables)
 inner_join(a, b)

 # include observations found in either "a" or "b"
 full_join(a, b)

 # include all observations found in "a"
 left_join(a, b)

 # include all observations found in "b"
 right_join(a, b)

 # right_join(a, b) is identical to left_join(b, a) except for column ordering
 left_join(b, a)

 # filter "a" to only show observations that match "b"
 semi_join(a, b)

 # filter "a" to only show observations that don't match "b"
 anti_join(a, b)
 ```


 ```{r}
 # sometimes matching variables don't have identical names
 b <- b %>% rename(col = color)

 # specify that the join should occur by matching "color" in "a" with "col" in "b"
 inner_join(a, b, by=c("color" = "col"))
 ```


 ## Viewing more output: print, View

 ```{r}
 # specify that you want to see more rows
 flights %>% print(n = 15)
 ```

 ```{r eval=FALSE}
 # specify that you want to see ALL rows (don't run this!)
 flights %>% print(n = Inf)
 ```

 ```{r}
 # specify that you want to see all columns
 flights %>% print(width = Inf)
 ```

 ```{r eval=FALSE}
 # show up to 1000 rows and all columns
 flights %>% View()

 # set option to see all columns and fewer rows
 options(dplyr.width = Inf, dplyr.print_min = 6)

 # reset options (or just close R)
 options(dplyr.width = NULL, dplyr.print_min = 10)
 ```


 ## Resources

 * Release announcements for [version 0.3](http://blog.rstudio.org/2014/10/13/dplyr-0-3-2/) and [version 0.4](http://blog.rstudio.org/2015/01/09/dplyr-0-4-0/)
 * [dplyr reference manual and vignettes](http://cran.r-project.org/web/packages/dplyr/)
 * [Two-table vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html) covering joins and set operations
 * [RStudio's Data Wrangling Cheat Sheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for dplyr and tidyr
 * [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases)


 ## Data School

 * [Blog](http://www.dataschool.io/)
 * [Email newsletter](http://www.dataschool.io/subscribe/)
 * [YouTube channel](http://youtube.com/dataschool)


 < END OF DOCUMENT >
	---
	title: 'Going deeper with dplyr: New features in 0.3 and 0.4'
	output: html_document
	---

	## Introduction

	In August 2014, I created a [40-minute video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) introducing the key functionality of the dplyr package in R, using dplyr version 0.2. Since then, there have been two significant updates to dplyr (0.3 and 0.4), introducing a ton of new features.

	This document (created in March 2015) covers the most useful new features in 0.3 and 0.4, as well as other functionality that I didn't cover last time (though it is not necessarily new). My [new video tutorial](https://www.youtube.com/watch?v=2mh1PqfsXVI) walks through the code below in detail.

	If you have not watched the [previous tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc), I recommend you do so first since it covers some dplyr basics that will not be covered in this tutorial.


	## Loading dplyr and the nycflights13 dataset

	Although my last tutorial used data from the hflights package, Hadley Wickham has rewritten the [dplyr vignettes](http://cran.r-project.org/web/packages/dplyr/index.html) to use the nycflights13 package instead, and so I'm also using nycflights13 for the sake of consistency.

	```{r eval=FALSE}
	# remove flights data if you just finished my previous tutorial
	rm(flights)
	```

	```{r}
	# load packages
	suppressMessages(library(dplyr))
	library(nycflights13)

	# print the flights dataset from nycflights13
	flights
	```


	## Choosing columns: select, rename

	```{r}
	# besides just using select() to pick columns...
	flights %>% select(carrier, flight)

	# ...you can use the minus sign to hide columns
	flights %>% select(-month, -day)
	```

	```{r results='hide'}
	# hide a range of columns
	flights %>% select(-(dep_time:arr_delay))

	# hide any column with a matching name
	flights %>% select(-contains("time"))
	```

	```{r}
	# pick columns using a character vector of column names
	cols <- c("carrier", "flight", "tailnum")
	flights %>% select(one_of(cols))
	```

	```{r}
	# select() can be used to rename columns, though all columns not mentioned are dropped
	flights %>% select(tail = tailnum)

	# rename() does the same thing, except all columns not mentioned are kept
	flights %>% rename(tail = tailnum)
	```


	## Choosing rows: filter, between, slice, sample_n, top_n, distinct

	```{r}
	# filter() supports the use of multiple conditions
	flights %>% filter(dep_time >= 600, dep_time <= 605)
	```

	```{r results='hide'}
	# between() is a concise alternative for determing if numeric values fall in a range
	flights %>% filter(between(dep_time, 600, 605))

	# side note: is.na() can also be useful when filtering
	flights %>% filter(!is.na(dep_time))
	```


	```{r}
	# slice() filters rows by position
	flights %>% slice(1000:1005)

	# keep the first three rows within each group
	flights %>% group_by(month, day) %>% slice(1:3)

	# sample three rows from each group
	flights %>% group_by(month, day) %>% sample_n(3)

	# keep three rows from each group with the top dep_delay
	flights %>% group_by(month, day) %>% top_n(3, dep_delay)

	# also sort by dep_delay within each group
	flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay))
	```


	```{r}
	# unique rows can be identified using unique() from base R
	flights %>% select(origin, dest) %>% unique()
	```

	```{r results='hide'}
	# dplyr provides an alternative that is more "efficient"
	flights %>% select(origin, dest) %>% distinct()

	# side note: when chaining, you don't have to include the parentheses if there are no arguments
	flights %>% select(origin, dest) %>% distinct
	```


	## Adding new variables: mutate, transmute, add_rownames

	```{r}
	# mutate() creates a new variable (and keeps all existing variables)
	flights %>% mutate(speed = distance/air_time*60)

	# transmute() only keeps the new variables
	flights %>% transmute(speed = distance/air_time*60)
	```


	```{r}
	# example data frame with row names
	mtcars %>% head()

	# add_rownames() turns row names into an explicit variable
	mtcars %>% add_rownames("model") %>% head()

	# side note: dplyr no longer prints row names (ever) for local data frames
	mtcars %>% tbl_df()
	```


	## Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup

	```{r}
	# summarise() can be used to count the number of rows in each group
	flights %>% group_by(month) %>% summarise(cnt = n())
	```

	```{r results='hide'}
	# tally() and count() can do this more concisely
	flights %>% group_by(month) %>% tally()
	flights %>% count(month)
	```

	```{r}
	# you can sort by the count
	flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt))
	```

	```{r results='hide'}
	# tally() and count() have a sort parameter for this purpose
	flights %>% group_by(month) %>% tally(sort=TRUE)
	flights %>% count(month, sort=TRUE)
	```

	```{r}
	# you can sum over a specific variable instead of simply counting rows
	flights %>% group_by(month) %>% summarise(dist = sum(distance))
	```

	```{r results='hide'}
	# tally() and count() have a wt parameter for this purpose
	flights %>% group_by(month) %>% tally(wt = distance)
	flights %>% count(month, wt = distance)
	```

	```{r}
	# group_size() returns the counts as a vector
	flights %>% group_by(month) %>% group_size()

	# n_groups() simply reports the number of groups
	flights %>% group_by(month) %>% n_groups()
	```

	```{r}
	# group by two variables, summarise, arrange (output is possibly confusing)
	flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40)

	# ungroup() before arranging to arrange across all groups
	flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt))
	```


	## Creating data frames: data_frame

	`data_frame()` is a better way than `data.frame()` for creating data frames. Benefits of `data_frame()`:

	* You can use previously defined columns to compute new columns.
	* It never coerces column types.
	* It never munges column names.
	* It never adds row names.
	* It only recycles length 1 input.
	* It returns a local data frame (a tbl_df).

	```{r}
	# data_frame() example
	data_frame(a = 1:6, b = a*2, c = 'string', 'd+e' = 1) %>% glimpse()

	# data.frame() example
	data.frame(a = 1:6, c = 'string', 'd+e' = 1) %>% glimpse()
	```


	## Joining (merging) tables: left_join, right_join, inner_join, full_join, semi_join, anti_join

	```{r}
	# create two simple data frames
	(a <- data_frame(color = c("green","yellow","red"), num = 1:3))
	(b <- data_frame(color = c("green","yellow","pink"), size = c("S","M","L")))

	# only include observations found in both "a" and "b" (automatically joins on variables that appear in both tables)
	inner_join(a, b)

	# include observations found in either "a" or "b"
	full_join(a, b)

	# include all observations found in "a"
	left_join(a, b)

	# include all observations found in "b"
	right_join(a, b)

	# right_join(a, b) is identical to left_join(b, a) except for column ordering
	left_join(b, a)

	# filter "a" to only show observations that match "b"
	semi_join(a, b)

	# filter "a" to only show observations that don't match "b"
	anti_join(a, b)
	```


	```{r}
	# sometimes matching variables don't have identical names
	b <- b %>% rename(col = color)

	# specify that the join should occur by matching "color" in "a" with "col" in "b"
	inner_join(a, b, by=c("color" = "col"))
	```


	## Viewing more output: print, View

	```{r}
	# specify that you want to see more rows
	flights %>% print(n = 15)
	```

	```{r eval=FALSE}
	# specify that you want to see ALL rows (don't run this!)
	flights %>% print(n = Inf)
	```

	```{r}
	# specify that you want to see all columns
	flights %>% print(width = Inf)
	```

	```{r eval=FALSE}
	# show up to 1000 rows and all columns
	flights %>% View()

	# set option to see all columns and fewer rows
	options(dplyr.width = Inf, dplyr.print_min = 6)

	# reset options (or just close R)
	options(dplyr.width = NULL, dplyr.print_min = 10)
	```


	## Resources

	* Release announcements for [version 0.3](http://blog.rstudio.org/2014/10/13/dplyr-0-3-2/) and [version 0.4](http://blog.rstudio.org/2015/01/09/dplyr-0-4-0/)
	* [dplyr reference manual and vignettes](http://cran.r-project.org/web/packages/dplyr/)
	* [Two-table vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html) covering joins and set operations
	* [RStudio's Data Wrangling Cheat Sheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for dplyr and tidyr
	* [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases)


	## Data School

	* [Blog](http://www.dataschool.io/)
	* [Email newsletter](http://www.dataschool.io/subscribe/)
	* [YouTube channel](http://youtube.com/dataschool)


	< END OF DOCUMENT >