swayson · March 15, 2015 12:42
diff --git a/dplyr-tut-02.rmd b/dplyr-tut-02.rmd
 ---
 title: "Introduction to dplyr for Faster Data Manipulation in R"
 output: html_document
 ---

 Note: There is a 40-minute [video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) on YouTube that walks through this document in detail.

 ## Why do I use dplyr?

 * Great for data exploration and transformation
 * Intuitive to write and easy to read, especially when using the "chaining" syntax (covered below)
 * Fast on data frames


 ## dplyr functionality

 * Five basic verbs: `filter`, `select`, `arrange`, `mutate`, `summarise` (plus `group_by`)
 * Can work with data stored in databases and [data tables](http://datatable.r-forge.r-project.org/)
 * Joins: inner join, left join, semi-join, anti-join (not covered below)
 * Window functions for calculating ranking, offsets, and more
 * [Better than plyr](http://blog.rstudio.org/2014/01/17/introducing-dplyr/) if you're only working with data frames (though it doesn't yet duplicate all of the plyr functionality)
 * Examples below are based upon the [latest release](https://github.com/hadley/dplyr/releases), version 0.2 (released May 2014)


 ## Loading dplyr and an example dataset

 * dplyr will mask a few base functions
 * If you also use plyr, load plyr first
 * hflights is flights departing from two Houston airports in 2011

 ```{r}
 # load packages
 suppressMessages(library(dplyr))
 library(hflights)

 # explore data
 data(hflights)
 head(hflights)
 ```

 * `tbl_df` creates a "local data frame"
 * Local data frame is simply a wrapper for a data frame that prints nicely

 ```{r}
 # convert to local data frame
 flights <- tbl_df(hflights)

 # printing only shows 10 rows and as many columns as can fit on your screen
 flights
 ```

 ```{r results='hide'}
 # you can specify that you want to see more rows
 print(flights, n=20)

 # convert to a normal data frame to see all of the columns
 data.frame(head(flights))
 ```


 ## filter: Keep rows matching criteria

 * Base R approach to filtering forces you to repeat the data frame's name
 * dplyr approach is simpler to write and read
 * Command structure (for all dplyr verbs):
    * first argument is a data frame
    * return value is a data frame
    * nothing is modified in place
 * Note: dplyr generally does not preserve row names

 ```{r results='hide'}
 # base R approach to view all flights on January 1
 flights[flights$Month==1 & flights$DayofMonth==1, ]
 ```

 ```{r}
 # dplyr approach
 # note: you can use comma or ampersand to represent AND condition
 filter(flights, Month==1, DayofMonth==1)

 # use pipe for OR condition
 filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")
 ```

 ```{r results='hide'}
 # you can also use %in% operator
 filter(flights, UniqueCarrier %in% c("AA", "UA"))
 ```


 ## select: Pick columns by name

 * Base R approach is awkward to type and to read
 * dplyr approach uses similar syntax to `filter`
 * Like a SELECT in SQL

 ```{r results='hide'}
 # base R approach to select DepTime, ArrTime, and FlightNum columns
 flights[, c("DepTime", "ArrTime", "FlightNum")]
 ```

 ```{r}
 # dplyr approach
 select(flights, DepTime, ArrTime, FlightNum)

 # use colon to select multiple contiguous columns, and use `contains` to match columns by name
 # note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
 select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
 ```


 ## "Chaining" or "Pipelining"

 * Usual way to perform multiple operations in one line is by nesting
 * Can write commands in a natural order by using the `%>%` infix operator (which can be pronounced as "then")

 ```{r results='hide'}
 # nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
 filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)
 ```

 ```{r}
 # chaining method
 flights %>%
    select(UniqueCarrier, DepDelay) %>%
    filter(DepDelay > 60)
 ```

 * Chaining increases readability significantly when there are many commands
 * Operator is automatically imported from the [magrittr](https://github.com/smbache/magrittr) package
 * Can be used to replace nesting in R commands outside of dplyr

 ```{r results='hide'}
 # create two vectors and calculate Euclidian distance between them
 x1 <- 1:5; x2 <- 2:6
 sqrt(sum((x1-x2)^2))
 ```

 ```{r}
 # chaining method
 (x1-x2)^2 %>% sum() %>% sqrt()
 ```


 ## arrange: Reorder rows

 ```{r results='hide'}
 # base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay
 flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]
 ```

 ```{r}
 # dplyr approach
 flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(DepDelay)
 ```

 ```{r results='hide'}
 # use `desc` for descending
 flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay))
 ```


 ## mutate: Add new variables

 * Create new variables that are functions of existing variables

 ```{r results='hide'}
 # base R approach to create a new variable Speed (in mph)
 flights$Speed <- flights$Distance / flights$AirTime*60
 flights[, c("Distance", "AirTime", "Speed")]
 ```

 ```{r}
 # dplyr approach (prints the new variable but does not store it)
 flights %>%
    select(Distance, AirTime) %>%
    mutate(Speed = Distance/AirTime*60)

 # store the new variable
 flights <- flights %>% mutate(Speed = Distance/AirTime*60)
 ```


 ## summarise: Reduce variables to values

 * Primarily useful with data that has been grouped by one or more variables
 * `group_by` creates the groups that will be operated on
 * `summarise` uses the provided aggregation function to summarise each group

 ```{r results='hide'}
 # base R approaches to calculate the average arrival delay to each destination
 head(with(flights, tapply(ArrDelay, Dest, mean, na.rm=TRUE)))
 head(aggregate(ArrDelay ~ Dest, flights, mean))
 ```

 ```{r}
 # dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
 flights %>%
    group_by(Dest) %>%
    summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))
 ```

 * `summarise_each` allows you to apply the same summary function to multiple columns at once
 * Note: `mutate_each` is also available

 ```{r}
 # for each carrier, calculate the percentage of flights cancelled or diverted
 flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(mean), Cancelled, Diverted)

 # for each carrier, calculate the minimum and maximum arrival and departure delays
 flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))
 ```

 * Helper function `n()` counts the number of rows in a group
 * Helper function `n_distinct(vector)` counts the number of unique items in that vector

 ```{r}
 # for each day of the year, count the total number of flights and sort in descending order
 flights %>%
    group_by(Month, DayofMonth) %>%
    summarise(flight_count = n()) %>%
    arrange(desc(flight_count))

 # rewrite more simply with the `tally` function
 flights %>%
    group_by(Month, DayofMonth) %>%
    tally(sort = TRUE)

 # for each destination, count the total number of flights and the number of distinct planes that flew there
 flights %>%
    group_by(Dest) %>%
    summarise(flight_count = n(), plane_count = n_distinct(TailNum))
 ```

 * Grouping can sometimes be useful without summarising

 ```{r}
 # for each destination, show the number of cancelled and not cancelled flights
 flights %>%
    group_by(Dest) %>%
    select(Cancelled) %>%
    table() %>%
    head()
 ```


 ## Window Functions

 * Aggregation function (like `mean`) takes n inputs and returns 1 value
 * [Window function](http://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) takes n inputs and returns n values
    * Includes ranking and ordering functions (like `min_rank`), offset functions (`lead` and `lag`), and cumulative aggregates (like `cummean`).

 ```{r results='hide'}
 # for each carrier, calculate which two days of the year they had their longest departure delays
 # note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value
 flights %>%
    group_by(UniqueCarrier) %>%
    select(Month, DayofMonth, DepDelay) %>%
    filter(min_rank(desc(DepDelay)) <= 2) %>%
    arrange(UniqueCarrier, desc(DepDelay))
 ```

 ```{r}
 # rewrite more simply with the `top_n` function
 flights %>%
    group_by(UniqueCarrier) %>%
    select(Month, DayofMonth, DepDelay) %>%
    top_n(2) %>%
    arrange(UniqueCarrier, desc(DepDelay))

 # for each month, calculate the number of flights and the change from the previous month
 flights %>%
    group_by(Month) %>%
    summarise(flight_count = n()) %>%
    mutate(change = flight_count - lag(flight_count))

 # rewrite more simply with the `tally` function
 flights %>%
    group_by(Month) %>%
    tally() %>%
    mutate(change = n - lag(n))
 ```


 ## Other Useful Convenience Functions

 ```{r}
 # randomly sample a fixed number of rows, without replacement
 flights %>% sample_n(5)

 # randomly sample a fraction of rows, with replacement
 flights %>% sample_frac(0.25, replace=TRUE)

 # base R approach to view the structure of an object
 str(flights)

 # dplyr approach: better formatting, and adapts to your screen width
 glimpse(flights)
 ```


 ## Connecting to Databases

 * dplyr can connect to a database as if the data was loaded into a data frame
 * Use the same syntax for local data frames and databases
 * Only generates SELECT statements
 * Currently supports SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB
 * Example below is based upon an SQLite database containing the hflights data
    * Instructions for creating this database are in the [databases vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html)

 ```{r}
 # connect to an SQLite database containing the hflights data
 my_db <- src_sqlite("my_db.sqlite3")

 # connect to the "hflights" table in that database
 flights_tbl <- tbl(my_db, "hflights")

 # example query with our data frame
 flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay))

 # identical query using the database
 flights_tbl %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay))
 ```

 * You can write the SQL commands yourself
 * dplyr can tell you the SQL it plans to run and the query execution plan

 ```{r}
 # send SQL commands to the database
 tbl(my_db, sql("SELECT * FROM hflights LIMIT 100"))

 # ask dplyr for the SQL commands
 flights_tbl %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(desc(DepDelay)) %>%
    explain()
 ```


 ## Resources

 * [Official dplyr reference manual and vignettes on CRAN](http://cran.r-project.org/web/packages/dplyr/index.html): vignettes are well-written and cover many aspects of dplyr
 * [July 2014 webinar about dplyr (and ggvis) by Hadley Wickham](http://pages.rstudio.net/Webinar-Series-Recording-Essential-Tools-for-R.html) and related [slides/code](https://github.com/rstudio/webinars/tree/master/2014-01): mostly conceptual, with a bit of code
 * [dplyr tutorial by Hadley Wickham](https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a) at the [useR! 2014 conference](http://user2014.stat.ucla.edu/): excellent, in-depth tutorial with lots of example code (Dropbox link includes slides, code files, and data files)
 * [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases)


 < END OF DOCUMENT >
	---
	title: "Introduction to dplyr for Faster Data Manipulation in R"
	output: html_document
	---

	Note: There is a 40-minute [video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) on YouTube that walks through this document in detail.

	## Why do I use dplyr?

	* Great for data exploration and transformation
	* Intuitive to write and easy to read, especially when using the "chaining" syntax (covered below)
	* Fast on data frames


	## dplyr functionality

	* Five basic verbs: `filter`, `select`, `arrange`, `mutate`, `summarise` (plus `group_by`)
	* Can work with data stored in databases and [data tables](http://datatable.r-forge.r-project.org/)
	* Joins: inner join, left join, semi-join, anti-join (not covered below)
	* Window functions for calculating ranking, offsets, and more
	* [Better than plyr](http://blog.rstudio.org/2014/01/17/introducing-dplyr/) if you're only working with data frames (though it doesn't yet duplicate all of the plyr functionality)
	* Examples below are based upon the [latest release](https://github.com/hadley/dplyr/releases), version 0.2 (released May 2014)


	## Loading dplyr and an example dataset

	* dplyr will mask a few base functions
	* If you also use plyr, load plyr first
	* hflights is flights departing from two Houston airports in 2011

	```{r}
	# load packages
	suppressMessages(library(dplyr))
	library(hflights)

	# explore data
	data(hflights)
	head(hflights)
	```

	* `tbl_df` creates a "local data frame"
	* Local data frame is simply a wrapper for a data frame that prints nicely

	```{r}
	# convert to local data frame
	flights <- tbl_df(hflights)

	# printing only shows 10 rows and as many columns as can fit on your screen
	flights
	```

	```{r results='hide'}
	# you can specify that you want to see more rows
	print(flights, n=20)

	# convert to a normal data frame to see all of the columns
	data.frame(head(flights))
	```


	## filter: Keep rows matching criteria

	* Base R approach to filtering forces you to repeat the data frame's name
	* dplyr approach is simpler to write and read
	* Command structure (for all dplyr verbs):
	* first argument is a data frame
	* return value is a data frame
	* nothing is modified in place
	* Note: dplyr generally does not preserve row names

	```{r results='hide'}
	# base R approach to view all flights on January 1
	flights[flights$Month==1 & flights$DayofMonth==1, ]
	```

	```{r}
	# dplyr approach
	# note: you can use comma or ampersand to represent AND condition
	filter(flights, Month==1, DayofMonth==1)

	# use pipe for OR condition
	filter(flights, UniqueCarrier=="AA" \| UniqueCarrier=="UA")
	```

	```{r results='hide'}
	# you can also use %in% operator
	filter(flights, UniqueCarrier %in% c("AA", "UA"))
	```


	## select: Pick columns by name

	* Base R approach is awkward to type and to read
	* dplyr approach uses similar syntax to `filter`
	* Like a SELECT in SQL

	```{r results='hide'}
	# base R approach to select DepTime, ArrTime, and FlightNum columns
	flights[, c("DepTime", "ArrTime", "FlightNum")]
	```

	```{r}
	# dplyr approach
	select(flights, DepTime, ArrTime, FlightNum)

	# use colon to select multiple contiguous columns, and use `contains` to match columns by name
	# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
	select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
	```


	## "Chaining" or "Pipelining"

	* Usual way to perform multiple operations in one line is by nesting
	* Can write commands in a natural order by using the `%>%` infix operator (which can be pronounced as "then")

	```{r results='hide'}
	# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
	filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)
	```

	```{r}
	# chaining method
	flights %>%
	select(UniqueCarrier, DepDelay) %>%
	filter(DepDelay > 60)
	```

	* Chaining increases readability significantly when there are many commands
	* Operator is automatically imported from the [magrittr](https://github.com/smbache/magrittr) package
	* Can be used to replace nesting in R commands outside of dplyr

	```{r results='hide'}
	# create two vectors and calculate Euclidian distance between them
	x1 <- 1:5; x2 <- 2:6
	sqrt(sum((x1-x2)^2))
	```

	```{r}
	# chaining method
	(x1-x2)^2 %>% sum() %>% sqrt()
	```


	## arrange: Reorder rows

	```{r results='hide'}
	# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay
	flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]
	```

	```{r}
	# dplyr approach
	flights %>%
	select(UniqueCarrier, DepDelay) %>%
	arrange(DepDelay)
	```

	```{r results='hide'}
	# use `desc` for descending
	flights %>%
	select(UniqueCarrier, DepDelay) %>%
	arrange(desc(DepDelay))
	```


	## mutate: Add new variables

	* Create new variables that are functions of existing variables

	```{r results='hide'}
	# base R approach to create a new variable Speed (in mph)
	flights$Speed <- flights$Distance / flights$AirTime*60
	flights[, c("Distance", "AirTime", "Speed")]
	```

	```{r}
	# dplyr approach (prints the new variable but does not store it)
	flights %>%
	select(Distance, AirTime) %>%
	mutate(Speed = Distance/AirTime*60)

	# store the new variable
	flights <- flights %>% mutate(Speed = Distance/AirTime*60)
	```


	## summarise: Reduce variables to values

	* Primarily useful with data that has been grouped by one or more variables
	* `group_by` creates the groups that will be operated on
	* `summarise` uses the provided aggregation function to summarise each group

	```{r results='hide'}
	# base R approaches to calculate the average arrival delay to each destination
	head(with(flights, tapply(ArrDelay, Dest, mean, na.rm=TRUE)))
	head(aggregate(ArrDelay ~ Dest, flights, mean))
	```

	```{r}
	# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
	flights %>%
	group_by(Dest) %>%
	summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))
	```

	* `summarise_each` allows you to apply the same summary function to multiple columns at once
	* Note: `mutate_each` is also available

	```{r}
	# for each carrier, calculate the percentage of flights cancelled or diverted
	flights %>%
	group_by(UniqueCarrier) %>%
	summarise_each(funs(mean), Cancelled, Diverted)

	# for each carrier, calculate the minimum and maximum arrival and departure delays
	flights %>%
	group_by(UniqueCarrier) %>%
	summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))
	```

	* Helper function `n()` counts the number of rows in a group
	* Helper function `n_distinct(vector)` counts the number of unique items in that vector

	```{r}
	# for each day of the year, count the total number of flights and sort in descending order
	flights %>%
	group_by(Month, DayofMonth) %>%
	summarise(flight_count = n()) %>%
	arrange(desc(flight_count))

	# rewrite more simply with the `tally` function
	flights %>%
	group_by(Month, DayofMonth) %>%
	tally(sort = TRUE)

	# for each destination, count the total number of flights and the number of distinct planes that flew there
	flights %>%
	group_by(Dest) %>%
	summarise(flight_count = n(), plane_count = n_distinct(TailNum))
	```

	* Grouping can sometimes be useful without summarising

	```{r}
	# for each destination, show the number of cancelled and not cancelled flights
	flights %>%
	group_by(Dest) %>%
	select(Cancelled) %>%
	table() %>%
	head()
	```


	## Window Functions

	* Aggregation function (like `mean`) takes n inputs and returns 1 value
	* [Window function](http://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) takes n inputs and returns n values
	* Includes ranking and ordering functions (like `min_rank`), offset functions (`lead` and `lag`), and cumulative aggregates (like `cummean`).

	```{r results='hide'}
	# for each carrier, calculate which two days of the year they had their longest departure delays
	# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value
	flights %>%
	group_by(UniqueCarrier) %>%
	select(Month, DayofMonth, DepDelay) %>%
	filter(min_rank(desc(DepDelay)) <= 2) %>%
	arrange(UniqueCarrier, desc(DepDelay))
	```

	```{r}
	# rewrite more simply with the `top_n` function
	flights %>%
	group_by(UniqueCarrier) %>%
	select(Month, DayofMonth, DepDelay) %>%
	top_n(2) %>%
	arrange(UniqueCarrier, desc(DepDelay))

	# for each month, calculate the number of flights and the change from the previous month
	flights %>%
	group_by(Month) %>%
	summarise(flight_count = n()) %>%
	mutate(change = flight_count - lag(flight_count))

	# rewrite more simply with the `tally` function
	flights %>%
	group_by(Month) %>%
	tally() %>%
	mutate(change = n - lag(n))
	```


	## Other Useful Convenience Functions

	```{r}
	# randomly sample a fixed number of rows, without replacement
	flights %>% sample_n(5)

	# randomly sample a fraction of rows, with replacement
	flights %>% sample_frac(0.25, replace=TRUE)

	# base R approach to view the structure of an object
	str(flights)

	# dplyr approach: better formatting, and adapts to your screen width
	glimpse(flights)
	```


	## Connecting to Databases

	* dplyr can connect to a database as if the data was loaded into a data frame
	* Use the same syntax for local data frames and databases
	* Only generates SELECT statements
	* Currently supports SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB
	* Example below is based upon an SQLite database containing the hflights data
	* Instructions for creating this database are in the [databases vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html)

	```{r}
	# connect to an SQLite database containing the hflights data
	my_db <- src_sqlite("my_db.sqlite3")

	# connect to the "hflights" table in that database
	flights_tbl <- tbl(my_db, "hflights")

	# example query with our data frame
	flights %>%
	select(UniqueCarrier, DepDelay) %>%
	arrange(desc(DepDelay))

	# identical query using the database
	flights_tbl %>%
	select(UniqueCarrier, DepDelay) %>%
	arrange(desc(DepDelay))
	```

	* You can write the SQL commands yourself
	* dplyr can tell you the SQL it plans to run and the query execution plan

	```{r}
	# send SQL commands to the database
	tbl(my_db, sql("SELECT * FROM hflights LIMIT 100"))

	# ask dplyr for the SQL commands
	flights_tbl %>%
	select(UniqueCarrier, DepDelay) %>%
	arrange(desc(DepDelay)) %>%
	explain()
	```


	## Resources

	* [Official dplyr reference manual and vignettes on CRAN](http://cran.r-project.org/web/packages/dplyr/index.html): vignettes are well-written and cover many aspects of dplyr
	* [July 2014 webinar about dplyr (and ggvis) by Hadley Wickham](http://pages.rstudio.net/Webinar-Series-Recording-Essential-Tools-for-R.html) and related [slides/code](https://github.com/rstudio/webinars/tree/master/2014-01): mostly conceptual, with a bit of code
	* [dplyr tutorial by Hadley Wickham](https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a) at the [useR! 2014 conference](http://user2014.stat.ucla.edu/): excellent, in-depth tutorial with lots of example code (Dropbox link includes slides, code files, and data files)
	* [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases)


	< END OF DOCUMENT >