Skip to content

Instantly share code, notes, and snippets.

@jdblischak
Last active November 5, 2024 13:34
Show Gist options
  • Save jdblischak/11384914 to your computer and use it in GitHub Desktop.
Save jdblischak/11384914 to your computer and use it in GitHub Desktop.
rnaseq-de-tutorial

This was a tutorial I presented for the class Genomics and Systems Biology at the University of Chicago on Tuesday, April 29, 2014. The students had been learning about study design, normalization, and statistical testing for genomic studies. This was meant to introduce them to how these ideas are implemented in practice. The specific example is a differential expression analysis with RNA-seq data for a two-class comparison.

To render this lesson, you'll need to first install the R package knitr and the R/Bioconductor packages edgeR and DESeq2. You can then create the lesson by running the following from the R console:

library("knitr")
knit2html("rnaseq-de-tutorial.Rmd", envir = new.env())

One known issue is that if you do not have the latest version of DESeq2 because you have an older version of R, the function rlog may not be available.

---
title: Differential expression analysis
author: John Blischak
date: 2015-04-21
output:
html_document:
toc: true
---
```{r settings, include = FALSE}
library("knitcitations")
library("knitr")
opts_chunk$set(tidy = FALSE, fig.width = 8, fig.height = 8, fig.pos = "center",
cache = FALSE)
```
In this tutorial, we will perform a basic differential expression analysis with RNA sequencing data using R/Bioconductor.
Before class, please download the data set and install the software as explained in the following section.
## Download data and install software
For this tutorial, we will use the data set generated by the Sequencing Quality Control ([SEQC][]) project.
Please download the data by running the line of R code below.
If this fails, you can try changing the `method` parameter, e.g. `method = "auto"`, `method = "wget"` or `method = "curl"`.
And as a last resort, you can download it by clicking on this [link][geo].
[SEQC]: http://www.fda.gov/ScienceResearch/BioinformaticsTools/MicroarrayQualityControlProject/
[geo]: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz
```{r download-data}
fname <- "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz"
download.file(fname, destfile = "GSE49712_HTSeq.txt.gz")
```
Next, we need to download and install the two [Bioconductor][bioc] packages we will use to perform the analysis.
[bioc]: http://www.bioconductor.org
```{r install-packages, eval = FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
```
## Introduction
As you learned in class, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments (see figure below from `r citet(c(Oshlack2010="10.1038/nrg2484"))`).
The number of reads that map to a gene is the proxy for its expression level.
![RNA-seq experiment](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949280/bin/nihms229948f1.jpg)
There are many steps between receiving the raw reads from the sequencer and gaining biological insight.
In this tutorial, we will start with a "Table of counts" and end with a "List of differentially expressed genes", as diagrammed in the RNA-seq analysis pipeline below (from `r citet(c(Wang2009="10.1186/gb-2010-11-12-220"))`).
![RNA-seq analysis pipeline](http://genomebiology.com/content/figures/gb-2010-11-12-220-1.jpg)
Unlike the continuous data that is generated by microarrays, RNA-seq generates counts.
While it is possible to transform the counts into a continuous distribution and perform an analysis similar to microarrays, e.g. linear regression, multiple approaches have been developed to directly model the data as counts (`r citet(c(Marioni2008="10.1101/gr.079558.108"))`).
We will perform a simple analysis using one popular Bioconductor package for differential expression analysis, [edgeR][].
edgeR (`r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10.1093/bioinformatics/btp616"))`) models the count data with a [negative binomial model][nb].
[edgeR]: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
[nb]: http://en.wikipedia.org/wiki/Negative_binomial_distribution
The SEQC data set will we analyze contains two different groups for comparison.
Group A is five technical replicates of the [Stratagene Universal Human Reference RNA][a].
It is an equal mixture of RNA from ten different human cell lines.
Group B is five techincal replicates of [Ambion’s Human Brain Reference RNA][b].
It is RNA that was pooled from several donors from several different regions of the brain.
[a]: http://www.chem.agilent.com/Library/usermanuals/Public/740000.pdf
[b]: http://www.lifetechnologies.com/order/catalog/product/AM6050
## Setup
First, we need to load the package into R.
```{r load-package}
library("edgeR")
```
Second, we need to load the data into R.
In order to use the code below, we need to ensure the data file is in R's working directory.
We can check our current working directory with the command `getwd()`, and we can change it using `setwd("path/to/file")` or using RStudio (type Ctrl-Shift-K).
```{r import-data}
data_raw <- read.table("GSE49712_HTSeq.txt.gz", header = TRUE)
```
## Quality control
Let's quickly explore the data.
Each column corresponds to a sample, and each row corresponds to a gene.
```{r view}
dim(data_raw)
head(data_raw)
tail(data_raw)
```
Notice that the last five lines contain summary statistics and thus need to be removed prior to testing.
```{r remove-sum-lines}
data_clean <- data_raw[1:(nrow(data_raw) - 5), ]
```
We should also remove genes that are unexpressed or very lowly expressed in the samples.
One simple method to do this is to choose a cutoff based on the median log2-transformed counts per gene per million mapped reads (cpm).
edgeR provides the function, `cpm`, to compute the counts-per-million.
```{r expr-cutoff}
cpm_log <- cpm(data_clean, log = TRUE)
median_log2_cpm <- apply(cpm_log, 1, median)
hist(median_log2_cpm)
expr_cutoff <- -1
abline(v = expr_cutoff, col = "red", lwd = 3)
sum(median_log2_cpm > expr_cutoff)
data_clean <- data_clean[median_log2_cpm > expr_cutoff, ]
```
After removing all genes with a median log2 cpm below `r expr_cutoff`, we have `r sum(median_log2_cpm > expr_cutoff)` genes remaining.
A good rule of thumb when analyzing RNA-seq data from a single cell type is to expect 9-12 thousand expressed genes.
In this case, we have many more because the samples include RNA collected from many different cell types, and thus it is not surprising that many more genes are robustly expressed.
**Question:**
What are some ways that we could improve this simple cutoff?
What information are we ignoring?
After filtering lowly expressed genes, we recalculate the log cpm.
```{r}
cpm_log <- cpm(data_clean, log = TRUE)
```
We can also explore the relationships between the samples by visualizing a heatmap of the correlation matrix.
The heatmap result corresponds to what we know about the data set.
First, the samples in group A and B come from very different cell populations, so the two groups are very different.
Second, since the samples in each group are technical replicates, the within group variance is very low.
```{r heatmap}
heatmap(cor(cpm_log))
```
Another method to view the relationships between samples is principal components analysis (PCA).
```{r pca}
pca <- prcomp(t(cpm_log), scale. = TRUE)
plot(pca$x[, 1], pca$x[, 2], pch = ".", xlab = "PC1", ylab = "PC2")
text(pca$x[, 1], pca$x[, 2], labels = colnames(cpm_log))
summary(pca)
```
## Two group comparison
Now we start preparing the data for the the test of differential expression.
We create a vector called, `group`, that labels each of the columns as belonging to group A or B.
We then use this vector and the gene counts to create a DGEList, which is the object that edgeR uses for storing the data from a differntial expression experiment.
```{r make-groups-edgeR}
group <- substr(colnames(data_clean), 1, 1)
group
y <- DGEList(counts = data_clean, group = group)
y
```
edgeR normalizes the genes counts using the method TMM (trimmed means of m values) developed by `r citet(c(Robinson2010="10.1186/gb-2010-11-3-r25"))`.
Recall from lecture that the read counts for moderately to lowly expressed genes can be strongly influenced by small fluctuations in the expression level of highly expressed genes.
In other words, small differences in expression of highly expressed genes between samples can give the appearance that many lower expressed genes are differentially expressed between conditions.
TMM adjusts for this by removing the extremely lowly and highly expressed genes and also those genes that are very different across samples.
It then compares the total counts for this subset of genes between the two samples to get the scaling factor (this is a simplification).
Similar to normalization methods for microarray data, this method assumes the majority of genes are not differentially expressed between any two samples.
```{r normalize-edgeR}
y <- calcNormFactors(y)
y$samples
```
The next step is to model the variance of the read counts per gene.
A natural method for modeling gene counts is the Poisson distribution.
However, the Poisson assumes the mean and variance are identical, but it has been found emipirically that the variance in RNA-seq measurements of gene expression are larger than the mean (termed "overdispersion").
So instead the negative binomial distribution is used, which has a dispersion parameter for modeling the increase in variance from a Poisson process.
edgeR treats the Poisson variance as simple sampling variance, and refers to the dispersion estimate as the "biolocial coefficient of variation."
Though it should be mentioned that any technical biases are also included in this estimate.
edgeR shares information across genes to determine a common dispersion.
It then calculates a dispersion esitmate per gene and shrinks it towards the common dispersion.
The gene-specific (referred to in edgeR as tagwise) dispersion estimates are used in the test for differential expression.
**Note:** edgeR also has a trended dispersion to use in place of the common dispersion for more complicated experimental designs.
```{r model-edgeR}
y <- estimateCommonDisp(y, verbose = TRUE)
y <- estimateTagwiseDisp(y)
sqrt(y$common.dispersion) # biological coefficient of variation
plotBCV(y)
```
The biological coefficient of variation is lower than normally seen in human studies (~0.4) because the samples are technical replicates.
edgeR tests for differential expression between two classes using a modified version of the Fisher's Exact Test.
**Note:** It also provides a generalized linear model frameworks for more complicated experimental designs.
```{r test-edgeR}
et <- exactTest(y)
results_edgeR <- topTags(et, n = nrow(data_clean), sort.by = "none")
head(results_edgeR$table)
```
How many genes are differentially expressed at an FDR of 10%?
```{r count-de-edgeR}
sum(results_edgeR$table$FDR < .1)
plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .01])
abline(h = c(-2, 2), col = "blue")
```
As expected from the description of the samples and the heatmap, there are many differentially expressed genes.
The [MA plot][ma] above plots the log2 fold change on the y-axis versus the average log2 counts-per-million on the x-axis.
The red dots are genes with an FDR less than 10%.
The blue lines represent a four-fold change in expression.
[ma]: https://en.wikipedia.org/wiki/MA_plot
## Adding covariates
The previous example was a two group comparison, but if you have additional covariates, you'll need to use a generalized linear model (GLM) framework.
Let's say we processed the samples in two batches, the samples were collected from males and females, and from a range of ages.
```{r}
set.seed(123)
batch <- sample(c("one", "two"), 10, replace = TRUE)
sex <- sample(c("M", "F"), 10, replace = TRUE)
age <- sample(18:50, 10, replace = TRUE)
```
We need to create a design matrix to describe the statistical model.
```{r}
y <- DGEList(data_clean)
y <- calcNormFactors(y)
design <- model.matrix(~group + batch + sex + age)
design
```
And now we test.
```{r}
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef = 2)
topTags(lrt)
```
And here is an example gene.
```{r}
boxplot(as.numeric(data_clean["HBB", ]) ~ group)
```
## To learn more
Check out the manual for [edgeR][erman], search the Bioconductor mailing list [archives][], and consider joining the Bioconductor mailing [list][].
For a rigorous comparison RNA-seq methods, see `r citep(c(Rapaport2013="10.1186/gb-2013-14-9-r95", Soneson2013="10.1186/1471-2105-14-91"))`.
The [source code][gist] for this lesson is availabe as a GitHub Gist.
[erman]: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
[archives]: http://dir.gmane.org/gmane.science.biology.informatics.conductor
[list]: http://www.bioconductor.org/help/mailing-list/
[gist]: https://gist.github.com/jdblischak/11384914
## References
```{r refs, results = 'asis', echo = FALSE}
bibliography()
```
## Session information
```{r info}
sessionInfo()
```
@sarahgrogan
Copy link

Great tutorial, you are teaching your students some great techniques!

Under 'How many genes are differentially expressed at an FDR of 10%?', do you mean to plot it as:
plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .1]) # Switched from <0.01

@jdblischak
Copy link
Author

Great tutorial, you are teaching your students some great techniques!

Thanks, @sarahgrogan!

Under 'How many genes are differentially expressed at an FDR of 10%?', do you mean to plot it as:
plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .1]) # Switched from <0.01

Good catch! I've fixed the typo.

P.S. GitHub doesn't send notifications when someone comments on one of your Gists, so my apologies for the delayed response (and potential future delayed responses). The only reason I saw this was because someone asked me about this file today.

@nsamanta
Copy link

nsamanta commented Dec 2, 2021

On line 257, why did you choose to plot for gene "HBB"?

@jdblischak
Copy link
Author

On line 257, why did you choose to plot for gene "HBB"?

@nsamanta I don't recall why I specifically chose HBB. I wanted to display a differentially expressed gene, so presumably I randomly chose one of the top 10. Or more likely, I probably chose it because it was the shortest gene name of the top 10 (I used this for a live class demo, so the less typing the better).

> topTags(lrt)
Coefficient:  groupB 
              logFC    logCPM        LR PValue FDR
C1QB       14.56743  4.713163  3246.549      0   0
LINC00320  13.69255  3.885942  2177.743      0   0
SLC17A6    13.25069  3.418738  1798.250      0   0
MAGEA3    -12.89692  4.421769  2336.185      0   0
MAGEA6    -12.82022  4.344894  2256.837      0   0
CARTPT     12.72237  2.871257  1541.809      0   0
GFAP       12.67005 11.477296 11726.685      0   0
SPHKAP     12.66812  5.957807  5768.492      0   0
NEUROD6    12.43813  3.934289  2270.489      0   0
HBB        12.42367  5.415885  4283.239      0   0

@nsamanta
Copy link

nsamanta commented Dec 2, 2021

Thank you @jdblischak! Your tutorial has been very helpful for a starter like me. I was struggling to understand doing DE analysis using several covariates.

I don't see a makecontrasts() function in your multivariate model. Does coef =2 has a similar function?

@jdblischak
Copy link
Author

Your tutorial has been very helpful for a starter like me. I was struggling to understand doing DE analysis using several covariates.

Glad to hear you're finding it useful!

I don't see a makecontrasts() function in your multivariate model. Does coef =2 has a similar function?

I didn't need to use makeContrasts() here because I only had one contrast of interest (group). Since my model has an intercept term, the effect of group is captured by the second coefficient, and hence I tested coef = 2.

makeContrasts() is useful when you have many comparisons to make, and especially if you use the group-means parametrization (ie a model with no intercept).

Here are some additional resources you may find informative:

These all use limma. But edgeR was written by the same authors, so the modeling advice applies to both.

@nsamanta
Copy link

nsamanta commented Dec 2, 2021

Is it possible to see the output for lrt <- glmLRT(fit, coef = 2) (line 250) above. Thank you!

@jdblischak
Copy link
Author

I'm confused. This tutorial is meant to be reproducible. Are you running into errors running it yourself on your local computer?

@nsamanta
Copy link

nsamanta commented Dec 6, 2021

@jdblischak Yes, I was running into some errors but was finally able to go through the complete tutorial. Sorry for the confusion! I followed both this and the other tutorial you sent the link to, very helpful! and I have some qq-

My contrast group looks like so, I am trying to adjust for age and batch effect-

   Contrasts
Levels        mari_vs_cont tobacco_vs_cont mari_vs_tobacco
  control               -1              -1               0
  marijuana              1               0               1
  tobacco                0               1              -1
  batchbatch2            0               0               0
  age                    0               0               0

I wasn't sure if using the coef = operator would work for me, so I used the following-
lrt_mc <- glmLRT(fit, contrast=makeContrasts(mari_vs_cont = marijuana - control, levels = colnames(design)))
I ran the above three of my contrast groups.

I am still a beginner and wasn't sure if this is the correct syntax? When trying to specify the contrast group using the contrast = operator, I kept getting an error that the specified group could not be found

@nsamanta
Copy link

nsamanta commented Dec 6, 2021

@jdblischak Also the design matrix for my above issue looks like so-

design
   control marijuana tobacco batchbatch2 age
1        0         1       0           0  22
2        0         1       0           0  23
3        0         1       0           0  24
4        0         1       0           0  25
5        0         1       0           0  25
6        0         1       0           0  29
7        0         1       0           0  30
8        0         1       0           0  30
9        0         1       0           0  30
10       0         1       0           0  30
11       0         1       0           0  30
12       0         1       0           0  31
13       0         1       0           0  31
14       0         1       0           0  32
15       0         1       0           0  33
16       0         0       1           0  34
17       0         0       1           0  34
18       0         0       1           0  35
19       0         0       1           0  35
20       0         0       1           0  36
21       0         0       1           0  36
22       0         0       1           0  37
23       0         0       1           0  37
24       0         0       1           0  38
25       0         0       1           0  38
26       0         0       1           0  38
27       0         0       1           0  39
28       0         0       1           0  39
29       0         0       1           0  39
30       0         0       1           0  40
31       0         0       1           1  40
32       1         0       0           1  41
33       1         0       0           1  41
34       1         0       0           1  41
35       1         0       0           1  47
36       1         0       0           1  50
37       1         0       0           1  50
38       1         0       0           1  50
39       1         0       0           1  53
40       1         0       0           1  54
41       1         0       0           1  55
attr(,"assign")
[1] 1 1 1 2 3
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

attr(,"contrasts")$batch
[1] "contr.treatment"

I used the likelihood ratio test (GLMlrt) for DE analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment