Created
November 26, 2015 11:22
-
-
Save klmr/75a5049e8903f102a35a to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Data visualisation in R | |
```{r echo=FALSE, message=FALSE} | |
# Preamble. | |
# To set the options below. | |
library(knitr) | |
# For pretty-printed tables. | |
library(pander) | |
panderOptions('table.split.table', Inf) | |
panderOptions('table.alignment.default', | |
function (df) ifelse(sapply(df, is.numeric), 'right', 'left')) | |
panderOptions('table.alignment.rownames', 'left') | |
# Enable automatic table reformatting. | |
opts_chunk$set(render = function (object, ...) { | |
if (pander_supported(object)) | |
pander(object, style = 'rmarkdown') | |
else if (isS4(object)) | |
show(object) | |
else | |
print(object) | |
}) | |
pander_supported = function (object) | |
UseMethod('pander_supported') | |
pander_supported.default = function (object) | |
any(class(object) %in% sub('^pander\\.', '', methods('pander'))) | |
pander.table = function (x, ...) | |
pander(`rownames<-`(rbind(x), NULL), ...) | |
# Helpers for dplyr tables | |
is.tbl_df = function (x) | |
inherits(x, 'tbl_df') | |
pander.tbl_df = function (x, ...) | |
pander(trunc_mat(x), ...) | |
# Copied from dplyr:::print.trunc_mat | |
pander.trunc_mat = function (x, ...) { | |
if (! is.null(x$table)) | |
pander(x$table, ...) | |
if (length(x$extra) > 0) { | |
var_types = paste0(names(x$extra), ' (', x$extra, ')', collapse = ', ') | |
pander(dplyr:::wrap('Variables not shown: ', var_types)) | |
} | |
} | |
# Disable code re-formatting. | |
opts_chunk$set(tidy = FALSE) | |
ggplot2::theme_set(ggplot2::theme_bw()) | |
``` | |
## Introduction | |
In this practical we are going to explore a few simple techniques for data | |
visualisation. Tasks to be completed by you during the course of the practical | |
are going to be formatted | |
> like this. | |
This course will be using a data set that ships with R, called `iris`. It | |
contains measurement data from different flowers. More information about the | |
data set can be found by typing `?iris` in R. | |
## Inspecting the data | |
Before we can visualise anything, we need to know the shape of the data. | |
> Inspect the data by outputting just the first few rows of it: | |
```{r echo=FALSE} | |
head(iris) | |
``` | |
## Base R plotting | |
Base R provides a few graphics utilities, among them `plot`, which, at its | |
most basic, creates a scatter or line plot. | |
> To warm up, simply plot the `Petal.Width` against the `Petal.Length`: | |
```{r echo=FALSE} | |
plot(Petal.Width ~ Petal.Length, iris) | |
``` | |
Clearly, the data contains some underlying structure. We now immediately turn | |
to more powerful visualisation tools rather than relying on base R. However, | |
bear in mind that base R is fully capable of doing this kind of analysis; the | |
other tools are simply more convenient to use. | |
## ggplot2 | |
‹ggplot2› is an R package that implements a fundamentally different approach to | |
plotting data: Plot creation is separated into several logically distinct steps | |
that can be combined at will. | |
1. Definition of the data to plot | |
2. Definition of the “aesthetics” (= mapping of variables to plot components) | |
3. Definition of the type(s) of visualisation layers | |
3. Fine-tuning, such as modifying axis titles etc. | |
> First, load the necessary library: | |
> | |
> ```{r message=FALSE} | |
> library(ggplot2) | |
> ``` | |
Next, let’s reproduce the previous plot. As mentioned before, this requires | |
several steps: | |
1. **Definition of the plot data\:** | |
```{r eval=FALSE} | |
ggplot(iris) | |
``` | |
2. **Definition of the aesthetics:** | |
```{r eval=FALSE} | |
aes(x = Petal.Length, y = Petal.Width) | |
``` | |
3. **Definition of the type of visualisation layer:** | |
```{r eval=FALSE} | |
geom_point() # for a simple scatter plot | |
``` | |
4. **Fine-tuning**, which we don’t need for now. | |
These components are combined by the use of the `+` operator. | |
> Try it: | |
```{r echo=FALSE} | |
ggplot(iris) + | |
aes(x = Petal.Length, y = Petal.Width) + | |
geom_point() | |
``` | |
Now, back to the analysis of the data. To investigate structure in the data, | |
it makes sense to augment the plot by other information we have about the | |
data. Once we’ve identified this, we can add these variables to the plot in a | |
suitable fashion. | |
> Inspect the data layout again. Identify a categorical variable that may be | |
> suitable to explain the structure we’ve observed. | |
Once you’ve identified a potentially explanatory variable, you can add it to | |
the plot in a variety of ways, for instance as shapes or colours, by adding | |
`shape = ‹variable›` or `color = ‹variable›` to the `aes` declaration. | |
> Add colours based on the categorical variable you’ve identified. | |
```{r echo=FALSE} | |
ggplot(iris) + | |
aes(x = Petal.Length, y = Petal.Width, color = Species) + | |
geom_point(show_guide = FALSE) | |
``` | |
## Inspecting data further | |
In the above case, it was possible to identify different clusters based on | |
species identity simply by plotting two variables. Often, it is necessary to | |
tease this information out first. | |
One of the most basic tools for this is the principal components analysis | |
(PCA). PCA rotates our data in n-dimensional space such that the most variance | |
comes to reside on the lowest dimensions. Thus, plotting the first two | |
dimensions of the rotated data is expected to show most of the structural | |
variability. | |
While this is especially useful for multidimensional data that we otherwise | |
cannot plot in 2D, it also helps with 2D data. Let’s apply a PCA to our toy | |
data: | |
```{r} | |
pca = prcomp(~ Petal.Length + Petal.Width, iris) | |
``` | |
The rotated data is available in `pca$x`. | |
> Inspect the data layout of `pca$` to find out which variables it contains. | |
> Plot the rotated data in a scatter plot, with PC1 on the x-axis and PC2 on | |
> y-axis. | |
> You will receive an error. How can you fix this error? Hint: try coercing | |
> the data into the desired format, which is `data.frame`. | |
```{r echo=FALSE} | |
ggplot(as.data.frame(pca$x)) + | |
aes(x = PC1, y = PC2) + | |
geom_point() | |
``` | |
Now we’ve lost the colour. To add it back in, we need to bind the required | |
information to the rotated coordinates: | |
> ```{r} | |
> pc_data = cbind(iris, pca$x) | |
> ``` | |
```{r echo=FALSE} | |
head(pc_data) | |
``` | |
> Plot the rotated data with colors for the categorical variable added again. | |
```{r echo=FALSE} | |
ggplot(pc_data) + | |
aes(x = PC1, y = PC2, color = Species) + | |
geom_point(show_guide = FALSE) | |
``` | |
## Separating data | |
It seems clear that petal width and length correlate quite highly. However, | |
this correlation may be mainly driven by species identity, rather than by a | |
correlation within each species. There are different ways of teasing the | |
information apart. | |
One is to plot the points separately for each species: | |
```{r} | |
ggplot(iris) + | |
aes(x = Petal.Length, y = Petal.Width, color = Species) + | |
geom_point(show_guide = FALSE) + | |
facet_wrap(~ Species, scales = 'free',) | |
``` | |
And another is to do per-category analysis: | |
```{r} | |
ggplot(iris) + | |
aes(x = Petal.Length, y = Petal.Width, color = Species) + | |
geom_point(show_guide = FALSE) + | |
geom_smooth(method = lm) | |
``` | |
## Transforming the data | |
Some of the correlation structure is still present, but let’s inspect how much, | |
for the case of “setosa”. This requires more tools. Let’s load them. | |
> ```{r message=FALSE} | |
> library(dplyr) | |
> ``` | |
> Use ‹dplyr› to extract the “setosa” data points from the data frame. | |
```{r echo=FALSE} | |
(setosa = iris %>% tbl_df() %>% filter(Species == 'setosa')) | |
``` | |
```{r} | |
ggplot(setosa) + | |
aes(x = Petal.Length, y = Petal.Width) + | |
geom_point() + | |
geom_smooth(method = lm) | |
``` | |
To see how good the variables correlate, we can calculate a correlation coefficient: | |
```{r} | |
cor(setosa$Petal.Length, setosa$Petal.Width) | |
``` | |
Or a more comprehensive linear model: | |
```{r} | |
summary(lm(Petal.Width ~ Petal.Length, setosa)) | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment