Created
March 15, 2015 12:42
-
-
Save swayson/7ec1b23fca948afd71e0 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Introduction to dplyr for Faster Data Manipulation in R" | |
output: html_document | |
--- | |
Note: There is a 40-minute [video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) on YouTube that walks through this document in detail. | |
## Why do I use dplyr? | |
* Great for data exploration and transformation | |
* Intuitive to write and easy to read, especially when using the "chaining" syntax (covered below) | |
* Fast on data frames | |
## dplyr functionality | |
* Five basic verbs: `filter`, `select`, `arrange`, `mutate`, `summarise` (plus `group_by`) | |
* Can work with data stored in databases and [data tables](http://datatable.r-forge.r-project.org/) | |
* Joins: inner join, left join, semi-join, anti-join (not covered below) | |
* Window functions for calculating ranking, offsets, and more | |
* [Better than plyr](http://blog.rstudio.org/2014/01/17/introducing-dplyr/) if you're only working with data frames (though it doesn't yet duplicate all of the plyr functionality) | |
* Examples below are based upon the [latest release](https://github.com/hadley/dplyr/releases), version 0.2 (released May 2014) | |
## Loading dplyr and an example dataset | |
* dplyr will mask a few base functions | |
* If you also use plyr, load plyr first | |
* hflights is flights departing from two Houston airports in 2011 | |
```{r} | |
# load packages | |
suppressMessages(library(dplyr)) | |
library(hflights) | |
# explore data | |
data(hflights) | |
head(hflights) | |
``` | |
* `tbl_df` creates a "local data frame" | |
* Local data frame is simply a wrapper for a data frame that prints nicely | |
```{r} | |
# convert to local data frame | |
flights <- tbl_df(hflights) | |
# printing only shows 10 rows and as many columns as can fit on your screen | |
flights | |
``` | |
```{r results='hide'} | |
# you can specify that you want to see more rows | |
print(flights, n=20) | |
# convert to a normal data frame to see all of the columns | |
data.frame(head(flights)) | |
``` | |
## filter: Keep rows matching criteria | |
* Base R approach to filtering forces you to repeat the data frame's name | |
* dplyr approach is simpler to write and read | |
* Command structure (for all dplyr verbs): | |
* first argument is a data frame | |
* return value is a data frame | |
* nothing is modified in place | |
* Note: dplyr generally does not preserve row names | |
```{r results='hide'} | |
# base R approach to view all flights on January 1 | |
flights[flights$Month==1 & flights$DayofMonth==1, ] | |
``` | |
```{r} | |
# dplyr approach | |
# note: you can use comma or ampersand to represent AND condition | |
filter(flights, Month==1, DayofMonth==1) | |
# use pipe for OR condition | |
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA") | |
``` | |
```{r results='hide'} | |
# you can also use %in% operator | |
filter(flights, UniqueCarrier %in% c("AA", "UA")) | |
``` | |
## select: Pick columns by name | |
* Base R approach is awkward to type and to read | |
* dplyr approach uses similar syntax to `filter` | |
* Like a SELECT in SQL | |
```{r results='hide'} | |
# base R approach to select DepTime, ArrTime, and FlightNum columns | |
flights[, c("DepTime", "ArrTime", "FlightNum")] | |
``` | |
```{r} | |
# dplyr approach | |
select(flights, DepTime, ArrTime, FlightNum) | |
# use colon to select multiple contiguous columns, and use `contains` to match columns by name | |
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name | |
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay")) | |
``` | |
## "Chaining" or "Pipelining" | |
* Usual way to perform multiple operations in one line is by nesting | |
* Can write commands in a natural order by using the `%>%` infix operator (which can be pronounced as "then") | |
```{r results='hide'} | |
# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes | |
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60) | |
``` | |
```{r} | |
# chaining method | |
flights %>% | |
select(UniqueCarrier, DepDelay) %>% | |
filter(DepDelay > 60) | |
``` | |
* Chaining increases readability significantly when there are many commands | |
* Operator is automatically imported from the [magrittr](https://github.com/smbache/magrittr) package | |
* Can be used to replace nesting in R commands outside of dplyr | |
```{r results='hide'} | |
# create two vectors and calculate Euclidian distance between them | |
x1 <- 1:5; x2 <- 2:6 | |
sqrt(sum((x1-x2)^2)) | |
``` | |
```{r} | |
# chaining method | |
(x1-x2)^2 %>% sum() %>% sqrt() | |
``` | |
## arrange: Reorder rows | |
```{r results='hide'} | |
# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay | |
flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")] | |
``` | |
```{r} | |
# dplyr approach | |
flights %>% | |
select(UniqueCarrier, DepDelay) %>% | |
arrange(DepDelay) | |
``` | |
```{r results='hide'} | |
# use `desc` for descending | |
flights %>% | |
select(UniqueCarrier, DepDelay) %>% | |
arrange(desc(DepDelay)) | |
``` | |
## mutate: Add new variables | |
* Create new variables that are functions of existing variables | |
```{r results='hide'} | |
# base R approach to create a new variable Speed (in mph) | |
flights$Speed <- flights$Distance / flights$AirTime*60 | |
flights[, c("Distance", "AirTime", "Speed")] | |
``` | |
```{r} | |
# dplyr approach (prints the new variable but does not store it) | |
flights %>% | |
select(Distance, AirTime) %>% | |
mutate(Speed = Distance/AirTime*60) | |
# store the new variable | |
flights <- flights %>% mutate(Speed = Distance/AirTime*60) | |
``` | |
## summarise: Reduce variables to values | |
* Primarily useful with data that has been grouped by one or more variables | |
* `group_by` creates the groups that will be operated on | |
* `summarise` uses the provided aggregation function to summarise each group | |
```{r results='hide'} | |
# base R approaches to calculate the average arrival delay to each destination | |
head(with(flights, tapply(ArrDelay, Dest, mean, na.rm=TRUE))) | |
head(aggregate(ArrDelay ~ Dest, flights, mean)) | |
``` | |
```{r} | |
# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay | |
flights %>% | |
group_by(Dest) %>% | |
summarise(avg_delay = mean(ArrDelay, na.rm=TRUE)) | |
``` | |
* `summarise_each` allows you to apply the same summary function to multiple columns at once | |
* Note: `mutate_each` is also available | |
```{r} | |
# for each carrier, calculate the percentage of flights cancelled or diverted | |
flights %>% | |
group_by(UniqueCarrier) %>% | |
summarise_each(funs(mean), Cancelled, Diverted) | |
# for each carrier, calculate the minimum and maximum arrival and departure delays | |
flights %>% | |
group_by(UniqueCarrier) %>% | |
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay")) | |
``` | |
* Helper function `n()` counts the number of rows in a group | |
* Helper function `n_distinct(vector)` counts the number of unique items in that vector | |
```{r} | |
# for each day of the year, count the total number of flights and sort in descending order | |
flights %>% | |
group_by(Month, DayofMonth) %>% | |
summarise(flight_count = n()) %>% | |
arrange(desc(flight_count)) | |
# rewrite more simply with the `tally` function | |
flights %>% | |
group_by(Month, DayofMonth) %>% | |
tally(sort = TRUE) | |
# for each destination, count the total number of flights and the number of distinct planes that flew there | |
flights %>% | |
group_by(Dest) %>% | |
summarise(flight_count = n(), plane_count = n_distinct(TailNum)) | |
``` | |
* Grouping can sometimes be useful without summarising | |
```{r} | |
# for each destination, show the number of cancelled and not cancelled flights | |
flights %>% | |
group_by(Dest) %>% | |
select(Cancelled) %>% | |
table() %>% | |
head() | |
``` | |
## Window Functions | |
* Aggregation function (like `mean`) takes n inputs and returns 1 value | |
* [Window function](http://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) takes n inputs and returns n values | |
* Includes ranking and ordering functions (like `min_rank`), offset functions (`lead` and `lag`), and cumulative aggregates (like `cummean`). | |
```{r results='hide'} | |
# for each carrier, calculate which two days of the year they had their longest departure delays | |
# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value | |
flights %>% | |
group_by(UniqueCarrier) %>% | |
select(Month, DayofMonth, DepDelay) %>% | |
filter(min_rank(desc(DepDelay)) <= 2) %>% | |
arrange(UniqueCarrier, desc(DepDelay)) | |
``` | |
```{r} | |
# rewrite more simply with the `top_n` function | |
flights %>% | |
group_by(UniqueCarrier) %>% | |
select(Month, DayofMonth, DepDelay) %>% | |
top_n(2) %>% | |
arrange(UniqueCarrier, desc(DepDelay)) | |
# for each month, calculate the number of flights and the change from the previous month | |
flights %>% | |
group_by(Month) %>% | |
summarise(flight_count = n()) %>% | |
mutate(change = flight_count - lag(flight_count)) | |
# rewrite more simply with the `tally` function | |
flights %>% | |
group_by(Month) %>% | |
tally() %>% | |
mutate(change = n - lag(n)) | |
``` | |
## Other Useful Convenience Functions | |
```{r} | |
# randomly sample a fixed number of rows, without replacement | |
flights %>% sample_n(5) | |
# randomly sample a fraction of rows, with replacement | |
flights %>% sample_frac(0.25, replace=TRUE) | |
# base R approach to view the structure of an object | |
str(flights) | |
# dplyr approach: better formatting, and adapts to your screen width | |
glimpse(flights) | |
``` | |
## Connecting to Databases | |
* dplyr can connect to a database as if the data was loaded into a data frame | |
* Use the same syntax for local data frames and databases | |
* Only generates SELECT statements | |
* Currently supports SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB | |
* Example below is based upon an SQLite database containing the hflights data | |
* Instructions for creating this database are in the [databases vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html) | |
```{r} | |
# connect to an SQLite database containing the hflights data | |
my_db <- src_sqlite("my_db.sqlite3") | |
# connect to the "hflights" table in that database | |
flights_tbl <- tbl(my_db, "hflights") | |
# example query with our data frame | |
flights %>% | |
select(UniqueCarrier, DepDelay) %>% | |
arrange(desc(DepDelay)) | |
# identical query using the database | |
flights_tbl %>% | |
select(UniqueCarrier, DepDelay) %>% | |
arrange(desc(DepDelay)) | |
``` | |
* You can write the SQL commands yourself | |
* dplyr can tell you the SQL it plans to run and the query execution plan | |
```{r} | |
# send SQL commands to the database | |
tbl(my_db, sql("SELECT * FROM hflights LIMIT 100")) | |
# ask dplyr for the SQL commands | |
flights_tbl %>% | |
select(UniqueCarrier, DepDelay) %>% | |
arrange(desc(DepDelay)) %>% | |
explain() | |
``` | |
## Resources | |
* [Official dplyr reference manual and vignettes on CRAN](http://cran.r-project.org/web/packages/dplyr/index.html): vignettes are well-written and cover many aspects of dplyr | |
* [July 2014 webinar about dplyr (and ggvis) by Hadley Wickham](http://pages.rstudio.net/Webinar-Series-Recording-Essential-Tools-for-R.html) and related [slides/code](https://github.com/rstudio/webinars/tree/master/2014-01): mostly conceptual, with a bit of code | |
* [dplyr tutorial by Hadley Wickham](https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a) at the [useR! 2014 conference](http://user2014.stat.ucla.edu/): excellent, in-depth tutorial with lots of example code (Dropbox link includes slides, code files, and data files) | |
* [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases) | |
< END OF DOCUMENT > |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment