Skip to content

Instantly share code, notes, and snippets.

@ismayc
Last active April 21, 2016 17:56
Show Gist options
  • Save ismayc/5677d9959ffbf2f71358642b64cc87cb to your computer and use it in GitHub Desktop.
Save ismayc/5677d9959ffbf2f71358642b64cc87cb to your computer and use it in GitHub Desktop.
Testing the new `caption.short` argument in `knitr::kable`
---
output: pdf_document
lot: true
---
# R Markdown Basics {#rmd-basics}
## Loading and exploring data
Included in this template is a link to a file called `flights.csv`. This file includes a subset of the larger dataset of information about all flights that departed from Seattle and Portland in 2014. More information about this dataset and its **R** package is available at <http://github.com/ismayc/pnwflights14>. This subset includes only Portland flights and only rows that were complete with no missing values. Merges were also done with the `airports` and `airlines` data sets in the `pnwflights14` package to get more descriptive airport and airline names.
We can load in this data set using the following command:
```{r load_data}
flights <- read.csv("https://raw.githubusercontent.com/ismayc/reedtemplates/master/inst/rmarkdown/templates/reed_thesis/skeleton/data/flights.csv")
```
The data is now stored in the data frame called `flights` in **R**. To get a better feel for the variables included in this dataset we can use a variety of functions. Here we can see the dimensions (rows by columns) and also the names of the columns.
```{r str}
dim(flights)
names(flights)
```
Another good idea is to take a look at the dataset in table form. With this dataset having more than 50,000 rows, we won't explicitly show the results of the command here. I recommend you enter the command into the Console **_after_** you have run the **R** chunks above to load the data into **R**.
```{r view_flights, eval = FALSE}
View(flights)
```
While not required, it is highly recommended you use the `dplyr` package to manipulate and summarize your data set as needed. It uses a syntax that is easy to understand using chaining operations. Below I've created a few examples of using `dplyr` to get information about the Portland flights in 2014. You will also see the use of the `ggplot2` package, which produces beautiful, high-quality academic visuals.
We begin by checking to ensure that needed packages are installed and then we load them into our current working environment:
```{r load_pkgs, message = FALSE}
# List of packages required for this analysis
pkg <- c("dplyr", "ggplot2", "knitr", "devtools")
# Check if packages are not installed and assign the
# names of the packages not installed to the variable new.pkg
new.pkg <- pkg[!(pkg %in% installed.packages())]
# If there are any packages in the list that aren't installed,
# install them
if (length(new.pkg))
install.packages(new.pkg, repos = "http://cran.rstudio.com")
# Load packages
library(dplyr)
library(ggplot2)
devtools::install_github("yihui/knitr", force = TRUE)
library(knitr)
knitr::opts_knit$set(kable.force.latex = TRUE)
```
The example we show here does the following:
- Selects only the `carrier_name` and `arr_delay` from the `flights` dataset and then assigns this subset to a new variable called `flights2`.
- Using `flights2`, we determine the largest arrival delay for each of the carriers.
```{r max_delays}
flights2 <- flights %>% dplyr::select(carrier_name, arr_delay)
max_delays <- flights2 %>% group_by(carrier_name) %>%
summarize(max_arr_delay = max(arr_delay, na.rm = TRUE))
```
We next introduce a useful function in the `knitr` package for making nice tables in _R Markdown_ called `kable`. It produces the \LaTeX\ code required to make the table and is much easier to use than manually entering values into a table by copying and pasting values into Excel or \LaTeX. This again goes to show how nice reproducible documents can be! There is no need to copy-and-paste values to create a table. (Note the use of `results = "asis"` here which will produce the table instead of the code to create the table. You'll learn more about the `\\label` later.)
```{r table_out, results = "asis"}
kable(max_delays, col.names = c("Airline", "Max Arrival Delay"),
caption = "Maximum Delays by Airline \\label{tab:max_delay}",
caption.short = "Max Delays by Airline")
```
We can further look into the properties of the largest value here for American Airlines Inc. To do so, we can isolate the row corresponding to the arrival delay of 1539 minutes for American in our original `flights` dataset.
```{r max_props}
flights %>% dplyr::filter(arr_delay == 1539,
carrier_name == "American Airlines Inc.") %>%
dplyr::select(-c(month, day, carrier, dest_name, hour,
minute, carrier_name, arr_delay))
```
We see that the flight occurred on March 3rd and departed a little after 2 PM on its way to Dallas/Fort Worth. Lastly, we show how we can visualize the arrival delay of all departing flights from Portland on March 3rd against time of departure.
```{r march3plot, fig.height = 3, fig.width = 6}
flights %>% dplyr::filter(month == 3, day == 3) %>%
ggplot(aes(x = dep_time, y = arr_delay)) +
geom_point()
```
## Additional resources
- _Markdown_ Cheatsheet - <https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet>
- _R Markdown_ Reference Guide - <https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf>
- Introduction to `dplyr` - <https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html>
- `ggplot2` Documentation - <http://docs.ggplot2.org/current/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment