|
Differential expression analysis |
|
======================================================== |
|
|
|
```{r settings, include = FALSE} |
|
library("knitcitations") |
|
library("knitr") |
|
opts_chunk$set(tidy = FALSE, fig.width = 8, fig.height = 8, fig.pos = "center", |
|
cache = TRUE) |
|
``` |
|
|
|
John Blischak |
|
|
|
Genomics and Sytems Biology, University of Chicago |
|
|
|
Date: `r as.character(Sys.Date())` |
|
|
|
In this tutorial, we will perform a basic differential expression analysis with RNA sequencing data using R/Bioconductor. |
|
Before class, please download the data set and install the software as explained in the following section. |
|
|
|
## Download data and install software |
|
|
|
For this tutorial, we will use the data set generated by the Sequencing Quality Control ([SEQC][]) project. |
|
Please download the data by running the line of R code below. |
|
If this fails, you can try changing the `method` parameter, e.g. `method = "auto"`, `method = "wget"` or `method = "curl"`. |
|
And as a last resort, you can download it by clicking on this [link][geo]. |
|
|
|
[SEQC]: http://www.fda.gov/ScienceResearch/BioinformaticsTools/MicroarrayQualityControlProject/ |
|
[geo]: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz |
|
|
|
```{r download-data} |
|
fname <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz" |
|
download.file(fname, destfile = "GSE49712_HTSeq.txt.gz", method = "curl") |
|
``` |
|
|
|
Next, we need to download and install the two [Bioconductor][bioc] packages we will use to perform the analysis. |
|
|
|
[bioc]: http://www.bioconductor.org |
|
|
|
```{r install-packages, eval = FALSE} |
|
source("http://bioconductor.org/biocLite.R") |
|
biocLite(c("edgeR", "DESeq2")) |
|
``` |
|
|
|
## Introduction |
|
|
|
As you learned in class, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments (see figure below from `r citet(c(Oshlack2010="10.1038/nrg2484"))`). |
|
The number of reads that map to a gene is the proxy for its expression level. |
|
|
|
![RNA-seq experiment](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949280/bin/nihms229948f1.jpg) |
|
|
|
There are many steps between receiving the raw reads from the sequencer and gaining biological insight. |
|
In this tutorial, we will start with a "Table of counts" and end with a "List of differentially expressed genes", as diagrammed in the RNA-seq analysis pipeline below (from `r citet(c(Wang2009="10.1186/gb-2010-11-12-220"))`). |
|
|
|
![RNA-seq analysis pipeline](http://genomebiology.com/content/figures/gb-2010-11-12-220-1.jpg) |
|
|
|
Unlike the continuous data that is generated by microarrays, RNA-seq generates counts. |
|
While it is possible to transform the counts into a continuous distribution and perform an analysis similar to microarrays, e.g. linear regression, multiple approaches have been developed to directly model the data as counts (`r citet(c(Marioni2008="10.1101/gr.079558.108"))`). |
|
We will perform a simple analysis using two very popular Bioconductor packages for differential expression analysis, [edgeR][] and [DESeq2][]. |
|
|
|
[edgeR]: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html |
|
[DESeq2]: http://bioconductor.org/packages/release/bioc/html/DESeq2.html |
|
|
|
The SEQC data set will we analyze contains two different groups for comparison. |
|
Group A is five technical replicates of the [Stratagene Universal Human Reference RNA][a]. |
|
It is an equal mixture of RNA from ten different human cell lines. |
|
Group B is five techincal replicates of [Ambion’s Human Brain Reference RNA][b]. |
|
It is RNA that was pooled from several donors from several different regions of the brain. |
|
|
|
[a]: http://www.chem.agilent.com/Library/usermanuals/Public/740000.pdf |
|
[b]: http://www.lifetechnologies.com/order/catalog/product/AM6050 |
|
|
|
## edgeR |
|
|
|
edgeR (`r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10.1093/bioinformatics/btp616"))`) models the count data with a [negative binomial model][nb]. |
|
|
|
[nb]: http://en.wikipedia.org/wiki/Negative_binomial_distribution |
|
|
|
First, we need to load the package into R. |
|
|
|
```{r load-package} |
|
library("edgeR") |
|
``` |
|
|
|
Second, we need to load the data into R. |
|
In order to use the code below, we need to ensure the data file is in R's working directory. |
|
We can check our current working directory with the command `getwd()`, and we can change it using `setwd("path/to/file")` or using RStudio (type Ctrl-Shift-K). |
|
|
|
```{r import-data} |
|
data_raw <- read.table("GSE49712_HTSeq.txt.gz", header = TRUE) |
|
``` |
|
|
|
Let's quickly explore the data. |
|
Each column corresponds to a sample, and each row corresponds to a gene. |
|
|
|
```{r view} |
|
dim(data_raw) |
|
head(data_raw) |
|
tail(data_raw) |
|
``` |
|
|
|
Notice that the last five lines contain summary statistics and thus need to be removed prior to testing. |
|
|
|
```{r remove-sum-lines} |
|
data_clean <- data_raw[1:(nrow(data_raw) - 5), ] |
|
``` |
|
|
|
We should also remove genes that are unexpressed or very lowly expressed in the samples. |
|
One simple method to do this is to choose a cutoff based on the median log2-transformed counts per gene per million mapped reads (cpm). |
|
edgeR provides the function, `cpm`, to compute the counts-per-million. |
|
|
|
```{r expr-cutoff} |
|
median_log2_cpm <- apply(cpm(data_clean, log = TRUE), 1, median) |
|
hist(median_log2_cpm) |
|
expr_cutoff <- -1 |
|
abline(v = expr_cutoff, col = "red", lwd = 3) |
|
sum(median_log2_cpm > expr_cutoff) |
|
data_clean <- data_clean[median_log2_cpm > expr_cutoff, ] |
|
``` |
|
|
|
After removing all genes with a median log2 cpm below `r expr_cutoff`, we have `r sum(median_log2_cpm > expr_cutoff)` genes remaining. |
|
A good rule of thumb when analyzing RNA-seq data from a single cell type is to expect 9-12 thousand expressed genes. |
|
In this case, we have many more because the samples include RNA collected from many different cell types, and thus it is not surprising that many more genes are robustly expressed. |
|
|
|
**Question:** |
|
What are some ways that we could improve this simple cutoff? |
|
What information are we ignoring? |
|
|
|
We can also explore the relationships between the samples by visualizing a heatmap of the correlation matrix. |
|
The heatmap result corresponds to what we know about the data set. |
|
First, the samples in group A and B come from very different cell populations, so the two groups are very different. |
|
Second, since the samples in each group are technical replicates, the within group variance is very low. |
|
|
|
```{r heatmap} |
|
heatmap(cor(cpm(data_clean, log = TRUE))) |
|
``` |
|
|
|
Now we start preparing the data for the the test of differential expression. |
|
We create a vector called, `group`, that labels each of the columns as belonging to group A or B. |
|
We then use this vector and the gene counts to create a DGEList, which is the object that edgeR uses for storing the data from a differntial expression experiment. |
|
|
|
```{r make-groups-edgeR} |
|
group <- substr(colnames(data_clean), 1, 1) |
|
group |
|
y <- DGEList(counts = data_clean, group = group) |
|
y |
|
``` |
|
|
|
|
|
edgeR normalizes the genes counts using the method TMM (trimmed means of m values) developed by `r citet(c(Robinson2010="10.1186/gb-2010-11-3-r25"))`. |
|
Recall from lecture that the read counts for moderately to lowly expressed genes can be strongly influenced by small fluctuations in the expression level of highly expressed genes. |
|
In other words, small differences in expression of highly expressed genes between samples can give the appearance that many lower expressed genes are differentially expressed between conditions. |
|
TMM adjusts for this by removing the extremely lowly and highly expressed genes and also those genes that are very different across samples. |
|
It then compares the total counts for this subset of genes between the two samples to get the scaling factor (this is a simplification). |
|
Similar to normalization methods for microarray data, this method assumes the majority of genes are not differentially expressed between any two samples. |
|
|
|
```{r normalize-edgeR} |
|
y <- calcNormFactors(y) |
|
y$samples |
|
``` |
|
|
|
The next step is to model the variance of the read counts per gene. |
|
A natural method for modeling gene counts is the Poisson distribution. |
|
However, the Poisson assumes the mean and variance are identical, but it has been found emipirically that the variance in RNA-seq measurements of gene expression are larger than the mean (termed "overdispersion"). |
|
So instead the negative binomial distribution is used, which has a dispersion parameter for modeling the increase in variance from a Poisson process. |
|
edgeR treats the Poisson variance as simple sampling variance, and refers to the dispersion estimate as the "biolocial coefficient of variation." |
|
Though it should be mentioned that any technical biases are also included in this estimate. |
|
edgeR shares information across genes to determine a common dispersion. |
|
It then calculates a dispersion esitmate per gene and shrinks it towards the common dispersion. |
|
The gene-specific (referred to in edgeR as tagwise) dispersion estimates are used in the test for differential expression. |
|
|
|
**Note:** edgeR also has a trended dispersion to use in place of the common dispersion for more complicated experimental designs. |
|
|
|
```{r model-edgeR} |
|
y <- estimateCommonDisp(y, verbose = TRUE) |
|
y <- estimateTagwiseDisp(y) |
|
sqrt(y$common.dispersion) # biological coefficient of variation |
|
plotBCV(y) |
|
``` |
|
|
|
The biological coefficient of variation is lower than normally seen in human studies (~0.4) because the samples are technical replicates. |
|
|
|
edgeR tests for differential expression between two classes using a modified version of the Fisher's Exact Test. |
|
|
|
**Note:** It also provides a generalized linear model frameworks for more complicated experimental designs. |
|
|
|
```{r test-edgeR} |
|
et <- exactTest(y) |
|
results_edgeR <- topTags(et, n = nrow(data_clean), sort.by = "none") |
|
head(results_edgeR$table) |
|
``` |
|
|
|
How many genes are differentially expressed at an FDR of 10%? |
|
|
|
```{r count-de-edgeR} |
|
sum(results_edgeR$table$FDR < .1) |
|
plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .01]) |
|
abline(h = c(-2, 2), col = "blue") |
|
``` |
|
|
|
As expected from the description of the samples and the heatmap, there are many differentially expressed genes. |
|
The [MA plot][ma] above plots the log2 fold change on the y-axis versus the average log2 counts-per-million on the x-axis. |
|
The red dots are genes with an FDR less than 10%. |
|
The blue lines represent a four-fold change in expression. |
|
|
|
[ma]: https://en.wikipedia.org/wiki/MA_plot |
|
|
|
## DESeq2 |
|
|
|
DESeq2 (`r citet(c(Love2014="10.1101/002832"))`) also models the gene counts as arising from a [negative binomial distribution][nb]. |
|
|
|
```{r load-DESeq2} |
|
library("DESeq2") |
|
dds <- DESeqDataSetFromMatrix(data_clean, as.data.frame(group), ~ group) |
|
dds |
|
``` |
|
|
|
For data exploration, the DESeq2 package provides a more sophisticated version of edgeR's `cpm` function which shrinks the dispersion for lowly expressed genes. |
|
The function is called `rlog`, for **r**egularlized **log** transformation. |
|
|
|
**Note:** `rlog` is a newer feature and may not be available for older versions of R. |
|
|
|
```{r heatmap-DESeq2} |
|
rld <- rlog(dds) |
|
rlogMat <- assay(rld) |
|
colnames(rlogMat) <- group |
|
heatmap(cor(rlogMat)) |
|
``` |
|
|
|
Similar to edgeR, DESeq2 assumes that most genes are not differentially expressed. |
|
Each sample is assigned a scaling factor which is computed by comparing the gene counts of the sample to the median gene counts of all the samples combined (`r citet(c(Anders2010="10.1186/gb-2010-11-10-r106"))`). |
|
|
|
```{r normalize-DESeq2} |
|
dds <- estimateSizeFactors(dds) |
|
``` |
|
|
|
For the dispersion, DESeq2 fits gene-wise dispersion estimates and then shrinks them towards a trended dispersion estimated by sharing information across genes. |
|
We can see in the figure below that the dispersion estimates are shrunk towards the red trend line. |
|
|
|
```{r model-DESeq2} |
|
dds <- estimateDispersions(dds) |
|
plotDispEsts(dds) |
|
``` |
|
|
|
For testing for differentially expressed genes, DESeq2 uses a negative binomial generalized linear model (GLM). |
|
|
|
**Note:** This GLM allows for more complicated experimental designs. |
|
|
|
```{r test-DESeq2} |
|
dds <- nbinomWaldTest(dds) |
|
results_DESeq2 <- results(dds) |
|
results_DESeq2 |
|
``` |
|
|
|
How many genes are differentially expressed at an FDR of 10%? |
|
|
|
```{r count-de-DESeq2} |
|
sum(results_DESeq2@listData$padj < .1, na.rm = TRUE) |
|
plotMA(results_DESeq2) |
|
``` |
|
|
|
Similar to the edgeR results, many genes are differntially expressed. |
|
The red dots are differentially expressed genes. |
|
|
|
DESeq2 also provides a convenience function that will perform the normalization, modeling, and testing steps. |
|
|
|
```{r quick-DESeq2} |
|
dds <- DESeq(dds) |
|
res <- results(dds) |
|
res |
|
``` |
|
|
|
## Comparison |
|
|
|
How similar are the FDR values obtained with edgeR and DESeq2? |
|
|
|
```{r compare-fdr} |
|
fdr_edgeR <- results_edgeR$table$FDR |
|
fdr_DESeq2 <- results_DESeq2@listData$padj |
|
plot(fdr_edgeR, fdr_DESeq2) |
|
abline(lm(fdr_DESeq2 ~ fdr_edgeR), col = "red") |
|
``` |
|
|
|
For a more rigorous comparison of these and other methods, see `r citep(c(Rapaport2013="10.1186/gb-2013-14-9-r95", Soneson2013="10.1186/1471-2105-14-91"))`. |
|
|
|
## To learn more |
|
|
|
Check out the manuals for [edgeR][erman] and [DESeq2][dsman], search the Bioconductor mailing list [archives][], and consider joining the Bioconductor mailing [list][]. |
|
The [source code][gist] for this lesson is availabe as a GitHub Gist. |
|
|
|
[erman]: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf |
|
[dsman]: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf |
|
[archives]: http://dir.gmane.org/gmane.science.biology.informatics.conductor |
|
[list]: http://www.bioconductor.org/help/mailing-list/ |
|
[gist]: https://gist.github.com/jdblischak/11384914 |
|
|
|
## References |
|
|
|
```{r refs, results = 'asis', echo = FALSE} |
|
bibliography("html") |
|
``` |
|
|
|
## Session information |
|
|
|
```{r info} |
|
sessionInfo() |
|
``` |
Great tutorial, you are teaching your students some great techniques!
Under 'How many genes are differentially expressed at an FDR of 10%?', do you mean to plot it as:
plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .1]) # Switched from <0.01