Skip to content

Instantly share code, notes, and snippets.

@swayson
Created March 15, 2015 12:42
Show Gist options
  • Save swayson/7ec1b23fca948afd71e0 to your computer and use it in GitHub Desktop.
Save swayson/7ec1b23fca948afd71e0 to your computer and use it in GitHub Desktop.
---
title: "Introduction to dplyr for Faster Data Manipulation in R"
output: html_document
---
Note: There is a 40-minute [video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) on YouTube that walks through this document in detail.
## Why do I use dplyr?
* Great for data exploration and transformation
* Intuitive to write and easy to read, especially when using the "chaining" syntax (covered below)
* Fast on data frames
## dplyr functionality
* Five basic verbs: `filter`, `select`, `arrange`, `mutate`, `summarise` (plus `group_by`)
* Can work with data stored in databases and [data tables](http://datatable.r-forge.r-project.org/)
* Joins: inner join, left join, semi-join, anti-join (not covered below)
* Window functions for calculating ranking, offsets, and more
* [Better than plyr](http://blog.rstudio.org/2014/01/17/introducing-dplyr/) if you're only working with data frames (though it doesn't yet duplicate all of the plyr functionality)
* Examples below are based upon the [latest release](https://github.com/hadley/dplyr/releases), version 0.2 (released May 2014)
## Loading dplyr and an example dataset
* dplyr will mask a few base functions
* If you also use plyr, load plyr first
* hflights is flights departing from two Houston airports in 2011
```{r}
# load packages
suppressMessages(library(dplyr))
library(hflights)
# explore data
data(hflights)
head(hflights)
```
* `tbl_df` creates a "local data frame"
* Local data frame is simply a wrapper for a data frame that prints nicely
```{r}
# convert to local data frame
flights <- tbl_df(hflights)
# printing only shows 10 rows and as many columns as can fit on your screen
flights
```
```{r results='hide'}
# you can specify that you want to see more rows
print(flights, n=20)
# convert to a normal data frame to see all of the columns
data.frame(head(flights))
```
## filter: Keep rows matching criteria
* Base R approach to filtering forces you to repeat the data frame's name
* dplyr approach is simpler to write and read
* Command structure (for all dplyr verbs):
* first argument is a data frame
* return value is a data frame
* nothing is modified in place
* Note: dplyr generally does not preserve row names
```{r results='hide'}
# base R approach to view all flights on January 1
flights[flights$Month==1 & flights$DayofMonth==1, ]
```
```{r}
# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(flights, Month==1, DayofMonth==1)
# use pipe for OR condition
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")
```
```{r results='hide'}
# you can also use %in% operator
filter(flights, UniqueCarrier %in% c("AA", "UA"))
```
## select: Pick columns by name
* Base R approach is awkward to type and to read
* dplyr approach uses similar syntax to `filter`
* Like a SELECT in SQL
```{r results='hide'}
# base R approach to select DepTime, ArrTime, and FlightNum columns
flights[, c("DepTime", "ArrTime", "FlightNum")]
```
```{r}
# dplyr approach
select(flights, DepTime, ArrTime, FlightNum)
# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
```
## "Chaining" or "Pipelining"
* Usual way to perform multiple operations in one line is by nesting
* Can write commands in a natural order by using the `%>%` infix operator (which can be pronounced as "then")
```{r results='hide'}
# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)
```
```{r}
# chaining method
flights %>%
select(UniqueCarrier, DepDelay) %>%
filter(DepDelay > 60)
```
* Chaining increases readability significantly when there are many commands
* Operator is automatically imported from the [magrittr](https://github.com/smbache/magrittr) package
* Can be used to replace nesting in R commands outside of dplyr
```{r results='hide'}
# create two vectors and calculate Euclidian distance between them
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))
```
```{r}
# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()
```
## arrange: Reorder rows
```{r results='hide'}
# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay
flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]
```
```{r}
# dplyr approach
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(DepDelay)
```
```{r results='hide'}
# use `desc` for descending
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
```
## mutate: Add new variables
* Create new variables that are functions of existing variables
```{r results='hide'}
# base R approach to create a new variable Speed (in mph)
flights$Speed <- flights$Distance / flights$AirTime*60
flights[, c("Distance", "AirTime", "Speed")]
```
```{r}
# dplyr approach (prints the new variable but does not store it)
flights %>%
select(Distance, AirTime) %>%
mutate(Speed = Distance/AirTime*60)
# store the new variable
flights <- flights %>% mutate(Speed = Distance/AirTime*60)
```
## summarise: Reduce variables to values
* Primarily useful with data that has been grouped by one or more variables
* `group_by` creates the groups that will be operated on
* `summarise` uses the provided aggregation function to summarise each group
```{r results='hide'}
# base R approaches to calculate the average arrival delay to each destination
head(with(flights, tapply(ArrDelay, Dest, mean, na.rm=TRUE)))
head(aggregate(ArrDelay ~ Dest, flights, mean))
```
```{r}
# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
flights %>%
group_by(Dest) %>%
summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))
```
* `summarise_each` allows you to apply the same summary function to multiple columns at once
* Note: `mutate_each` is also available
```{r}
# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(mean), Cancelled, Diverted)
# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))
```
* Helper function `n()` counts the number of rows in a group
* Helper function `n_distinct(vector)` counts the number of unique items in that vector
```{r}
# for each day of the year, count the total number of flights and sort in descending order
flights %>%
group_by(Month, DayofMonth) %>%
summarise(flight_count = n()) %>%
arrange(desc(flight_count))
# rewrite more simply with the `tally` function
flights %>%
group_by(Month, DayofMonth) %>%
tally(sort = TRUE)
# for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
group_by(Dest) %>%
summarise(flight_count = n(), plane_count = n_distinct(TailNum))
```
* Grouping can sometimes be useful without summarising
```{r}
# for each destination, show the number of cancelled and not cancelled flights
flights %>%
group_by(Dest) %>%
select(Cancelled) %>%
table() %>%
head()
```
## Window Functions
* Aggregation function (like `mean`) takes n inputs and returns 1 value
* [Window function](http://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) takes n inputs and returns n values
* Includes ranking and ordering functions (like `min_rank`), offset functions (`lead` and `lag`), and cumulative aggregates (like `cummean`).
```{r results='hide'}
# for each carrier, calculate which two days of the year they had their longest departure delays
# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
filter(min_rank(desc(DepDelay)) <= 2) %>%
arrange(UniqueCarrier, desc(DepDelay))
```
```{r}
# rewrite more simply with the `top_n` function
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
top_n(2) %>%
arrange(UniqueCarrier, desc(DepDelay))
# for each month, calculate the number of flights and the change from the previous month
flights %>%
group_by(Month) %>%
summarise(flight_count = n()) %>%
mutate(change = flight_count - lag(flight_count))
# rewrite more simply with the `tally` function
flights %>%
group_by(Month) %>%
tally() %>%
mutate(change = n - lag(n))
```
## Other Useful Convenience Functions
```{r}
# randomly sample a fixed number of rows, without replacement
flights %>% sample_n(5)
# randomly sample a fraction of rows, with replacement
flights %>% sample_frac(0.25, replace=TRUE)
# base R approach to view the structure of an object
str(flights)
# dplyr approach: better formatting, and adapts to your screen width
glimpse(flights)
```
## Connecting to Databases
* dplyr can connect to a database as if the data was loaded into a data frame
* Use the same syntax for local data frames and databases
* Only generates SELECT statements
* Currently supports SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB
* Example below is based upon an SQLite database containing the hflights data
* Instructions for creating this database are in the [databases vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html)
```{r}
# connect to an SQLite database containing the hflights data
my_db <- src_sqlite("my_db.sqlite3")
# connect to the "hflights" table in that database
flights_tbl <- tbl(my_db, "hflights")
# example query with our data frame
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
# identical query using the database
flights_tbl %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
```
* You can write the SQL commands yourself
* dplyr can tell you the SQL it plans to run and the query execution plan
```{r}
# send SQL commands to the database
tbl(my_db, sql("SELECT * FROM hflights LIMIT 100"))
# ask dplyr for the SQL commands
flights_tbl %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay)) %>%
explain()
```
## Resources
* [Official dplyr reference manual and vignettes on CRAN](http://cran.r-project.org/web/packages/dplyr/index.html): vignettes are well-written and cover many aspects of dplyr
* [July 2014 webinar about dplyr (and ggvis) by Hadley Wickham](http://pages.rstudio.net/Webinar-Series-Recording-Essential-Tools-for-R.html) and related [slides/code](https://github.com/rstudio/webinars/tree/master/2014-01): mostly conceptual, with a bit of code
* [dplyr tutorial by Hadley Wickham](https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a) at the [useR! 2014 conference](http://user2014.stat.ucla.edu/): excellent, in-depth tutorial with lots of example code (Dropbox link includes slides, code files, and data files)
* [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases)
< END OF DOCUMENT >
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment