Created
March 15, 2015 12:43
-
-
Save swayson/362025719f1841512280 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: 'Going deeper with dplyr: New features in 0.3 and 0.4' | |
output: html_document | |
--- | |
## Introduction | |
In August 2014, I created a [40-minute video tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc) introducing the key functionality of the dplyr package in R, using dplyr version 0.2. Since then, there have been two significant updates to dplyr (0.3 and 0.4), introducing a ton of new features. | |
This document (created in March 2015) covers the most useful new features in 0.3 and 0.4, as well as other functionality that I didn't cover last time (though it is not necessarily new). My [new video tutorial](https://www.youtube.com/watch?v=2mh1PqfsXVI) walks through the code below in detail. | |
**If you have not watched the [previous tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc)**, I recommend you do so first since it covers some dplyr basics that will not be covered in this tutorial. | |
## Loading dplyr and the nycflights13 dataset | |
Although my last tutorial used data from the hflights package, Hadley Wickham has rewritten the [dplyr vignettes](http://cran.r-project.org/web/packages/dplyr/index.html) to use the nycflights13 package instead, and so I'm also using nycflights13 for the sake of consistency. | |
```{r eval=FALSE} | |
# remove flights data if you just finished my previous tutorial | |
rm(flights) | |
``` | |
```{r} | |
# load packages | |
suppressMessages(library(dplyr)) | |
library(nycflights13) | |
# print the flights dataset from nycflights13 | |
flights | |
``` | |
## Choosing columns: select, rename | |
```{r} | |
# besides just using select() to pick columns... | |
flights %>% select(carrier, flight) | |
# ...you can use the minus sign to hide columns | |
flights %>% select(-month, -day) | |
``` | |
```{r results='hide'} | |
# hide a range of columns | |
flights %>% select(-(dep_time:arr_delay)) | |
# hide any column with a matching name | |
flights %>% select(-contains("time")) | |
``` | |
```{r} | |
# pick columns using a character vector of column names | |
cols <- c("carrier", "flight", "tailnum") | |
flights %>% select(one_of(cols)) | |
``` | |
```{r} | |
# select() can be used to rename columns, though all columns not mentioned are dropped | |
flights %>% select(tail = tailnum) | |
# rename() does the same thing, except all columns not mentioned are kept | |
flights %>% rename(tail = tailnum) | |
``` | |
## Choosing rows: filter, between, slice, sample_n, top_n, distinct | |
```{r} | |
# filter() supports the use of multiple conditions | |
flights %>% filter(dep_time >= 600, dep_time <= 605) | |
``` | |
```{r results='hide'} | |
# between() is a concise alternative for determing if numeric values fall in a range | |
flights %>% filter(between(dep_time, 600, 605)) | |
# side note: is.na() can also be useful when filtering | |
flights %>% filter(!is.na(dep_time)) | |
``` | |
```{r} | |
# slice() filters rows by position | |
flights %>% slice(1000:1005) | |
# keep the first three rows within each group | |
flights %>% group_by(month, day) %>% slice(1:3) | |
# sample three rows from each group | |
flights %>% group_by(month, day) %>% sample_n(3) | |
# keep three rows from each group with the top dep_delay | |
flights %>% group_by(month, day) %>% top_n(3, dep_delay) | |
# also sort by dep_delay within each group | |
flights %>% group_by(month, day) %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay)) | |
``` | |
```{r} | |
# unique rows can be identified using unique() from base R | |
flights %>% select(origin, dest) %>% unique() | |
``` | |
```{r results='hide'} | |
# dplyr provides an alternative that is more "efficient" | |
flights %>% select(origin, dest) %>% distinct() | |
# side note: when chaining, you don't have to include the parentheses if there are no arguments | |
flights %>% select(origin, dest) %>% distinct | |
``` | |
## Adding new variables: mutate, transmute, add_rownames | |
```{r} | |
# mutate() creates a new variable (and keeps all existing variables) | |
flights %>% mutate(speed = distance/air_time*60) | |
# transmute() only keeps the new variables | |
flights %>% transmute(speed = distance/air_time*60) | |
``` | |
```{r} | |
# example data frame with row names | |
mtcars %>% head() | |
# add_rownames() turns row names into an explicit variable | |
mtcars %>% add_rownames("model") %>% head() | |
# side note: dplyr no longer prints row names (ever) for local data frames | |
mtcars %>% tbl_df() | |
``` | |
## Grouping and counting: summarise, tally, count, group_size, n_groups, ungroup | |
```{r} | |
# summarise() can be used to count the number of rows in each group | |
flights %>% group_by(month) %>% summarise(cnt = n()) | |
``` | |
```{r results='hide'} | |
# tally() and count() can do this more concisely | |
flights %>% group_by(month) %>% tally() | |
flights %>% count(month) | |
``` | |
```{r} | |
# you can sort by the count | |
flights %>% group_by(month) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) | |
``` | |
```{r results='hide'} | |
# tally() and count() have a sort parameter for this purpose | |
flights %>% group_by(month) %>% tally(sort=TRUE) | |
flights %>% count(month, sort=TRUE) | |
``` | |
```{r} | |
# you can sum over a specific variable instead of simply counting rows | |
flights %>% group_by(month) %>% summarise(dist = sum(distance)) | |
``` | |
```{r results='hide'} | |
# tally() and count() have a wt parameter for this purpose | |
flights %>% group_by(month) %>% tally(wt = distance) | |
flights %>% count(month, wt = distance) | |
``` | |
```{r} | |
# group_size() returns the counts as a vector | |
flights %>% group_by(month) %>% group_size() | |
# n_groups() simply reports the number of groups | |
flights %>% group_by(month) %>% n_groups() | |
``` | |
```{r} | |
# group by two variables, summarise, arrange (output is possibly confusing) | |
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% arrange(desc(cnt)) %>% print(n = 40) | |
# ungroup() before arranging to arrange across all groups | |
flights %>% group_by(month, day) %>% summarise(cnt = n()) %>% ungroup() %>% arrange(desc(cnt)) | |
``` | |
## Creating data frames: data_frame | |
`data_frame()` is a better way than `data.frame()` for creating data frames. Benefits of `data_frame()`: | |
* You can use previously defined columns to compute new columns. | |
* It never coerces column types. | |
* It never munges column names. | |
* It never adds row names. | |
* It only recycles length 1 input. | |
* It returns a local data frame (a tbl_df). | |
```{r} | |
# data_frame() example | |
data_frame(a = 1:6, b = a*2, c = 'string', 'd+e' = 1) %>% glimpse() | |
# data.frame() example | |
data.frame(a = 1:6, c = 'string', 'd+e' = 1) %>% glimpse() | |
``` | |
## Joining (merging) tables: left_join, right_join, inner_join, full_join, semi_join, anti_join | |
```{r} | |
# create two simple data frames | |
(a <- data_frame(color = c("green","yellow","red"), num = 1:3)) | |
(b <- data_frame(color = c("green","yellow","pink"), size = c("S","M","L"))) | |
# only include observations found in both "a" and "b" (automatically joins on variables that appear in both tables) | |
inner_join(a, b) | |
# include observations found in either "a" or "b" | |
full_join(a, b) | |
# include all observations found in "a" | |
left_join(a, b) | |
# include all observations found in "b" | |
right_join(a, b) | |
# right_join(a, b) is identical to left_join(b, a) except for column ordering | |
left_join(b, a) | |
# filter "a" to only show observations that match "b" | |
semi_join(a, b) | |
# filter "a" to only show observations that don't match "b" | |
anti_join(a, b) | |
``` | |
```{r} | |
# sometimes matching variables don't have identical names | |
b <- b %>% rename(col = color) | |
# specify that the join should occur by matching "color" in "a" with "col" in "b" | |
inner_join(a, b, by=c("color" = "col")) | |
``` | |
## Viewing more output: print, View | |
```{r} | |
# specify that you want to see more rows | |
flights %>% print(n = 15) | |
``` | |
```{r eval=FALSE} | |
# specify that you want to see ALL rows (don't run this!) | |
flights %>% print(n = Inf) | |
``` | |
```{r} | |
# specify that you want to see all columns | |
flights %>% print(width = Inf) | |
``` | |
```{r eval=FALSE} | |
# show up to 1000 rows and all columns | |
flights %>% View() | |
# set option to see all columns and fewer rows | |
options(dplyr.width = Inf, dplyr.print_min = 6) | |
# reset options (or just close R) | |
options(dplyr.width = NULL, dplyr.print_min = 10) | |
``` | |
## Resources | |
* Release announcements for [version 0.3](http://blog.rstudio.org/2014/10/13/dplyr-0-3-2/) and [version 0.4](http://blog.rstudio.org/2015/01/09/dplyr-0-4-0/) | |
* [dplyr reference manual and vignettes](http://cran.r-project.org/web/packages/dplyr/) | |
* [Two-table vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html) covering joins and set operations | |
* [RStudio's Data Wrangling Cheat Sheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for dplyr and tidyr | |
* [dplyr GitHub repo](https://github.com/hadley/dplyr) and [list of releases](https://github.com/hadley/dplyr/releases) | |
## Data School | |
* [Blog](http://www.dataschool.io/) | |
* [Email newsletter](http://www.dataschool.io/subscribe/) | |
* [YouTube channel](http://youtube.com/dataschool) | |
< END OF DOCUMENT > |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment