klmr · November 26, 2015 11:22
diff --git a/practical.rmd b/practical.rmd
 # Data visualisation in R

 ```{r echo=FALSE, message=FALSE}
 # Preamble.

 # To set the options below.
 library(knitr)
 # For pretty-printed tables.
 library(pander)

 panderOptions('table.split.table', Inf)
 panderOptions('table.alignment.default',
              function (df) ifelse(sapply(df, is.numeric), 'right', 'left'))
 panderOptions('table.alignment.rownames', 'left')

 # Enable automatic table reformatting.
 opts_chunk$set(render = function (object, ...) {
    if (pander_supported(object))
        pander(object, style = 'rmarkdown')
    else if (isS4(object))
        show(object)
    else
        print(object)
 })

 pander_supported = function (object)
    UseMethod('pander_supported')

 pander_supported.default = function (object)
    any(class(object) %in% sub('^pander\\.', '', methods('pander')))

 pander.table = function (x, ...)
    pander(`rownames<-`(rbind(x), NULL), ...)

 # Helpers for dplyr tables

 is.tbl_df = function (x)
    inherits(x, 'tbl_df')

 pander.tbl_df = function (x, ...)
    pander(trunc_mat(x), ...)

 # Copied from dplyr:::print.trunc_mat
 pander.trunc_mat = function (x, ...) {
    if (! is.null(x$table))
        pander(x$table, ...)

    if (length(x$extra) > 0) {
        var_types = paste0(names(x$extra), ' (', x$extra, ')', collapse = ', ')
        pander(dplyr:::wrap('Variables not shown: ', var_types))
    }
 }

 # Disable code re-formatting.
 opts_chunk$set(tidy = FALSE)

 ggplot2::theme_set(ggplot2::theme_bw())
 ```

 ## Introduction

 In this practical we are going to explore a few simple techniques for data
 visualisation. Tasks to be completed by you during the course of the practical
 are going to be formatted

 > like this.

 This course will be using a data set that ships with R, called `iris`. It
 contains measurement data from different flowers. More information about the
 data set can be found by typing `?iris` in R.

 ## Inspecting the data

 Before we can visualise anything, we need to know the shape of the data.

 > Inspect the data by outputting just the first few rows of it:

 ```{r echo=FALSE}
 head(iris)
 ```

 ## Base R plotting

 Base R provides a few graphics utilities, among them `plot`, which, at its
 most basic, creates a scatter or line plot.

 > To warm up, simply plot the `Petal.Width` against the `Petal.Length`:

 ```{r echo=FALSE}
 plot(Petal.Width ~ Petal.Length, iris)
 ```

 Clearly, the data contains some underlying structure. We now immediately turn
 to more powerful visualisation tools rather than relying on base R. However,
 bear in mind that base R is fully capable of doing this kind of analysis; the
 other tools are simply more convenient to use.

 ## ggplot2

 ‹ggplot2› is an R package that implements a fundamentally different approach to
 plotting data: Plot creation is separated into several logically distinct steps
 that can be combined at will.

 1. Definition of the data to plot
 2. Definition of the “aesthetics” (= mapping of variables to plot components)
 3. Definition of the type(s) of visualisation layers
 3. Fine-tuning, such as modifying axis titles etc.

 > First, load the necessary library:
 >
 > ```{r message=FALSE}
 > library(ggplot2)
 > ```

 Next, let’s reproduce the previous plot. As mentioned before, this requires
 several steps:

 1. **Definition of the plot data\:**

    ```{r eval=FALSE}
    ggplot(iris)
    ```

 2. **Definition of the aesthetics:**

    ```{r eval=FALSE}
    aes(x = Petal.Length, y = Petal.Width)
    ```

 3. **Definition of the type of visualisation layer:**

    ```{r eval=FALSE}
    geom_point() # for a simple scatter plot
    ```

 4. **Fine-tuning**, which we don’t need for now.

 These components are combined by the use of the `+` operator.

 > Try it:

 ```{r echo=FALSE}
 ggplot(iris) +
    aes(x = Petal.Length, y = Petal.Width) +
    geom_point()
 ```

 Now, back to the analysis of the data. To investigate structure in the data,
 it makes sense to augment the plot by other information we have about the
 data. Once we’ve identified this, we can add these variables to the plot in a
 suitable fashion.

 > Inspect the data layout again. Identify a categorical variable that may be
 > suitable to explain the structure we’ve observed.

 Once you’ve identified a potentially explanatory variable, you can add it to
 the plot in a variety of ways, for instance as shapes or colours, by adding
 `shape = ‹variable›` or `color = ‹variable›` to the `aes` declaration.

 > Add colours based on the categorical variable you’ve identified.

 ```{r echo=FALSE}
 ggplot(iris) +
    aes(x = Petal.Length, y = Petal.Width, color = Species) +
    geom_point(show_guide = FALSE)
 ```

 ## Inspecting data further

 In the above case, it was possible to identify different clusters based on
 species identity simply by plotting two variables. Often, it is necessary to
 tease this information out first.

 One of the most basic tools for this is the principal components analysis
 (PCA). PCA rotates our data in n-dimensional space such that the most variance
 comes to reside on the lowest dimensions. Thus, plotting the first two
 dimensions of the rotated data is expected to show most of the structural
 variability.

 While this is especially useful for multidimensional data that we otherwise
 cannot plot in 2D, it also helps with 2D data. Let’s apply a PCA to our toy
 data:

 ```{r}
 pca = prcomp(~ Petal.Length + Petal.Width, iris)
 ```

 The rotated data is available in `pca$x`.

 > Inspect the data layout of `pca$` to find out which variables it contains.
 > Plot the rotated data in a scatter plot, with PC1 on the x-axis and PC2 on
 > y-axis.

 > You will receive an error. How can you fix this error? Hint: try coercing
 > the data into the desired format, which is `data.frame`.

 ```{r echo=FALSE}
 ggplot(as.data.frame(pca$x)) +
    aes(x = PC1, y = PC2) +
    geom_point()
 ```

 Now we’ve lost the colour. To add it back in, we need to bind the required
 information to the rotated coordinates:

 > ```{r}
 > pc_data = cbind(iris, pca$x)
 > ```

 ```{r echo=FALSE}
 head(pc_data)
 ```

 > Plot the rotated data with colors for the categorical variable added again.

 ```{r echo=FALSE}
 ggplot(pc_data) +
    aes(x = PC1, y = PC2, color = Species) +
    geom_point(show_guide = FALSE)
 ```

 ## Separating data

 It seems clear that petal width and length correlate quite highly. However,
 this correlation may be mainly driven by species identity, rather than by a
 correlation within each species. There are different ways of teasing the
 information apart.

 One is to plot the points separately for each species:

 ```{r}
 ggplot(iris) +
    aes(x = Petal.Length, y = Petal.Width, color = Species) +
    geom_point(show_guide = FALSE) +
    facet_wrap(~ Species, scales = 'free',)
 ```

 And another is to do per-category analysis:

 ```{r}
 ggplot(iris) +
    aes(x = Petal.Length, y = Petal.Width, color = Species) +
    geom_point(show_guide = FALSE) +
    geom_smooth(method = lm)
 ```

 ## Transforming the data

 Some of the correlation structure is still present, but let’s inspect how much,
 for the case of “setosa”. This requires more tools. Let’s load them.

 > ```{r message=FALSE}
 > library(dplyr)
 > ```

 > Use ‹dplyr› to extract the “setosa” data points from the data frame.

 ```{r echo=FALSE}
 (setosa = iris %>% tbl_df() %>% filter(Species == 'setosa'))
 ```

 ```{r}
 ggplot(setosa) +
    aes(x = Petal.Length, y = Petal.Width) +
    geom_point() +
    geom_smooth(method = lm)
 ```

 To see how good the variables correlate, we can calculate a correlation coefficient:

 ```{r}
 cor(setosa$Petal.Length, setosa$Petal.Width)
 ```

 Or a more comprehensive linear model:

 ```{r}
 summary(lm(Petal.Width ~ Petal.Length, setosa))
 ```
	# Data visualisation in R

	```{r echo=FALSE, message=FALSE}
	# Preamble.

	# To set the options below.
	library(knitr)
	# For pretty-printed tables.
	library(pander)

	panderOptions('table.split.table', Inf)
	panderOptions('table.alignment.default',
	function (df) ifelse(sapply(df, is.numeric), 'right', 'left'))
	panderOptions('table.alignment.rownames', 'left')

	# Enable automatic table reformatting.
	opts_chunk$set(render = function (object, ...) {
	if (pander_supported(object))
	pander(object, style = 'rmarkdown')
	else if (isS4(object))
	show(object)
	else
	print(object)
	})

	pander_supported = function (object)
	UseMethod('pander_supported')

	pander_supported.default = function (object)
	any(class(object) %in% sub('^pander\\.', '', methods('pander')))

	pander.table = function (x, ...)
	pander(`rownames<-`(rbind(x), NULL), ...)

	# Helpers for dplyr tables

	is.tbl_df = function (x)
	inherits(x, 'tbl_df')

	pander.tbl_df = function (x, ...)
	pander(trunc_mat(x), ...)

	# Copied from dplyr:::print.trunc_mat
	pander.trunc_mat = function (x, ...) {
	if (! is.null(x$table))
	pander(x$table, ...)

	if (length(x$extra) > 0) {
	var_types = paste0(names(x$extra), ' (', x$extra, ')', collapse = ', ')
	pander(dplyr:::wrap('Variables not shown: ', var_types))
	}
	}

	# Disable code re-formatting.
	opts_chunk$set(tidy = FALSE)

	ggplot2::theme_set(ggplot2::theme_bw())
	```

	## Introduction

	In this practical we are going to explore a few simple techniques for data
	visualisation. Tasks to be completed by you during the course of the practical
	are going to be formatted

	> like this.

	This course will be using a data set that ships with R, called `iris`. It
	contains measurement data from different flowers. More information about the
	data set can be found by typing `?iris` in R.

	## Inspecting the data

	Before we can visualise anything, we need to know the shape of the data.

	> Inspect the data by outputting just the first few rows of it:

	```{r echo=FALSE}
	head(iris)
	```

	## Base R plotting

	Base R provides a few graphics utilities, among them `plot`, which, at its
	most basic, creates a scatter or line plot.

	> To warm up, simply plot the `Petal.Width` against the `Petal.Length`:

	```{r echo=FALSE}
	plot(Petal.Width ~ Petal.Length, iris)
	```

	Clearly, the data contains some underlying structure. We now immediately turn
	to more powerful visualisation tools rather than relying on base R. However,
	bear in mind that base R is fully capable of doing this kind of analysis; the
	other tools are simply more convenient to use.

	## ggplot2

	‹ggplot2› is an R package that implements a fundamentally different approach to
	plotting data: Plot creation is separated into several logically distinct steps
	that can be combined at will.

	1. Definition of the data to plot
	2. Definition of the “aesthetics” (= mapping of variables to plot components)
	3. Definition of the type(s) of visualisation layers
	3. Fine-tuning, such as modifying axis titles etc.

	> First, load the necessary library:
	>
	> ```{r message=FALSE}
	> library(ggplot2)
	> ```

	Next, let’s reproduce the previous plot. As mentioned before, this requires
	several steps:

	1. Definition of the plot data\:

	```{r eval=FALSE}
	ggplot(iris)
	```

	2. Definition of the aesthetics:

	```{r eval=FALSE}
	aes(x = Petal.Length, y = Petal.Width)
	```

	3. Definition of the type of visualisation layer:

	```{r eval=FALSE}
	geom_point() # for a simple scatter plot
	```

	4. Fine-tuning, which we don’t need for now.

	These components are combined by the use of the `+` operator.

	> Try it:

	```{r echo=FALSE}
	ggplot(iris) +
	aes(x = Petal.Length, y = Petal.Width) +
	geom_point()
	```

	Now, back to the analysis of the data. To investigate structure in the data,
	it makes sense to augment the plot by other information we have about the
	data. Once we’ve identified this, we can add these variables to the plot in a
	suitable fashion.

	> Inspect the data layout again. Identify a categorical variable that may be
	> suitable to explain the structure we’ve observed.

	Once you’ve identified a potentially explanatory variable, you can add it to
	the plot in a variety of ways, for instance as shapes or colours, by adding
	`shape = ‹variable›` or `color = ‹variable›` to the `aes` declaration.

	> Add colours based on the categorical variable you’ve identified.

	```{r echo=FALSE}
	ggplot(iris) +
	aes(x = Petal.Length, y = Petal.Width, color = Species) +
	geom_point(show_guide = FALSE)
	```

	## Inspecting data further

	In the above case, it was possible to identify different clusters based on
	species identity simply by plotting two variables. Often, it is necessary to
	tease this information out first.

	One of the most basic tools for this is the principal components analysis
	(PCA). PCA rotates our data in n-dimensional space such that the most variance
	comes to reside on the lowest dimensions. Thus, plotting the first two
	dimensions of the rotated data is expected to show most of the structural
	variability.

	While this is especially useful for multidimensional data that we otherwise
	cannot plot in 2D, it also helps with 2D data. Let’s apply a PCA to our toy
	data:

	```{r}
	pca = prcomp(~ Petal.Length + Petal.Width, iris)
	```

	The rotated data is available in `pca$x`.

	> Inspect the data layout of `pca$` to find out which variables it contains.
	> Plot the rotated data in a scatter plot, with PC1 on the x-axis and PC2 on
	> y-axis.

	> You will receive an error. How can you fix this error? Hint: try coercing
	> the data into the desired format, which is `data.frame`.

	```{r echo=FALSE}
	ggplot(as.data.frame(pca$x)) +
	aes(x = PC1, y = PC2) +
	geom_point()
	```

	Now we’ve lost the colour. To add it back in, we need to bind the required
	information to the rotated coordinates:

	> ```{r}
	> pc_data = cbind(iris, pca$x)
	> ```

	```{r echo=FALSE}
	head(pc_data)
	```

	> Plot the rotated data with colors for the categorical variable added again.

	```{r echo=FALSE}
	ggplot(pc_data) +
	aes(x = PC1, y = PC2, color = Species) +
	geom_point(show_guide = FALSE)
	```

	## Separating data

	It seems clear that petal width and length correlate quite highly. However,
	this correlation may be mainly driven by species identity, rather than by a
	correlation within each species. There are different ways of teasing the
	information apart.

	One is to plot the points separately for each species:

	```{r}
	ggplot(iris) +
	aes(x = Petal.Length, y = Petal.Width, color = Species) +
	geom_point(show_guide = FALSE) +
	facet_wrap(~ Species, scales = 'free',)
	```

	And another is to do per-category analysis:

	```{r}
	ggplot(iris) +
	aes(x = Petal.Length, y = Petal.Width, color = Species) +
	geom_point(show_guide = FALSE) +
	geom_smooth(method = lm)
	```

	## Transforming the data

	Some of the correlation structure is still present, but let’s inspect how much,
	for the case of “setosa”. This requires more tools. Let’s load them.

	> ```{r message=FALSE}
	> library(dplyr)
	> ```

	> Use ‹dplyr› to extract the “setosa” data points from the data frame.

	```{r echo=FALSE}
	(setosa = iris %>% tbl_df() %>% filter(Species == 'setosa'))
	```

	```{r}
	ggplot(setosa) +
	aes(x = Petal.Length, y = Petal.Width) +
	geom_point() +
	geom_smooth(method = lm)
	```

	To see how good the variables correlate, we can calculate a correlation coefficient:

	```{r}
	cor(setosa$Petal.Length, setosa$Petal.Width)
	```

	Or a more comprehensive linear model:

	```{r}
	summary(lm(Petal.Width ~ Petal.Length, setosa))
	```