Introduction/tl;dr: I wrote this post as a reference for a few new graduate students in my department that are getting started with RNA-seq data analysis. It begins with an informal, big-picture overview of RNA-seq data analysis, and the general flow of the post outlines one standard RNA-seq workflow, but I wanted to give general audiences a "heads-up" that the post goes into quite a bit of nitty gritty detail that's specific to our department's computing setup.
RNA-seq is a high-throughput technology used to measure gene expression in cell populations. For a super bare-bones picture of what gene expression is, please enjoy this ASCII art I made to illustrate the process:
[DNA] ACGTAGGT{CGTATTT}AGCGT{AGCGCCCGA}TTACA
|
transcription
|
V
[pre-RNA] {GCAUAAA}UCGCA{UCGCGGGCU}
|
splicing
|
V
[RNA] {GCAUAAA}{UCGCGGGCU}
As shown in the awesome ASCII illustration, RNA molecules are basically strings of characters (G, C, U, or A; referred to as "nucleotides" or "bases"). RNA molecules are transcribed from DNA, and mature RNA consists only of the "expressed" parts of a gene (denoted in the illustration between curly brackets). Sequencing machines read off the base sequences of the mature RNA molecules in the cells in question. Base calling is kind of a hard problem, and current technologies can only read off about 100 nucleotides in a row, even though actual human RNA transcripts can be hundreds or thousands of bases long. These ~100-base reads are then written out into a text file in FASTQ format. These are what I consider "raw" RNA-seq data.
We can do lots of fun and scientifically-interesting things with RNA-seq data. I am mostly in the business of developing statistical methods to determine whether specific RNA transcripts are expressed at different levels in different cell populations (differential expression analysis). You can measure a transcript's expression level, or abundance in the cell, by figuring out how many RNA-seq reads came from that particular transcript. You can also do a lot of other cool stuff with RNA-seq, like discover new genes/transcripts, compare genes' expression levels to each other, or study the expression profile of a gene or transcript over time (e.g., which genes are expressed while a cell is developing from a stem cell into a heart cell?). Since there is so much we can learn from this type of data, it's being used to research complex diseases (e.g. cancer, psychiatric disease), evolutionary biology, organism development, and lots of other fields. If you'd like a more formal overview of RNA-seq, I highly recommend this review paper (open access!). A more formal writeup of the analysis process is available in this protocols paper (also open access).
- A Linux cluster with the Sun Grid Engine scheduling system. (Definitely not required for RNA-seq analysis, but the workflow described here assumes it).
- TopHat
- Cufflinks
- A reference/index/set of annotations for the organism you're analyzing. I recommend downloading a reference from this page. Any of the choices for your organism are fine.
- A good text editor. I use Sublime on my laptop, and I transfer files to the cluster using sftp (sometimes with the Cyberduck GUI). To edit text files directly on the cluster, I use emacs and I open files for editing with
emacs -nw myfile.sh
. When I'm done editing the file, I close it with ctrl-x ctrl-c, then hit "y" to save. - Data. This pipeline will work if you're starting with either FASTQ files (raw reads) or .bam files (read alignments).
I usually set up the directory for an RNA-seq analysis like this, if I'm starting with raw reads:
ProjectName/
├── data/
├── sample1_1.fastq
├── sample1_2.fastq
├── sample2_1.fastq
├── sample2_1.fastq
├── ...
├── sampleN_1.fastq
└── sampleN_2.fastq
├── alignments/
└── scripts/
├── assemblies/
├── scripts/
└── merged/
├── DE_analysis/
├── reference/
└── Homo_Sapiens/
(for transcript assembly with Cufflinks, usually we use paired-end reads, which is why there are 2 fastq files for each sample).
My setup will look like this, if I'm starting from read alignments (bam files):
ProjectName/
├── alignments/
├── sample1.bam
├── sample2.bam
├── ...
└── sampleN.bam
├── assemblies/
├── scripts/
└── merged/
├── DE_analysis/
├── reference/
└── Homo_Sapiens/
You don't need new reference files for every experiment you analyze, but for this example, I'm just going to assume the reference is located in your project directory, so that the paths in the scripts make sense.
input: RNA-seq reads, in FASTQ or FASTA format (extensions .fastq
, .fq
, .fasta
, or .fa
). They can be zipped/compressed.
output: read alignments, in .bam
format. (BAM is the compressed form of SAM, i.e. "sequence alignment/map" format, described here).
task: determine where on the genome (DNA) each RNA-seq read came from
run how many times: once for each sample in your experiment
move to step 2 if: you already have read alignments in .bam format
TopHat has to be run separately on every sample. For datasets with several samples, I like to run the TopHat jobs in parallel. I also like to use multiple threads/cores to run each job, to improve speed; this is defined by the -p
argument to TopHat. So for each sample, I submit a bash script that looks like this (defining some variables before the actual TopHat command so the command isn't totally unreadable):
#!/bin/sh
SAMPLE_ID=1
TOPHAT_BINARY=/path/to/tophat/executable/tophat
GENE_REFERENCE=/ProjectName/reference/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
BOWTIE_INDEX=/ProjectName/reference/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome
P=4 #use 4 threads
$TOPHAT_BINARY -G $GENE_REFERENCE -p $P -o /ProjectName/alignments/sample${SAMPLE_ID} $BOWTIE_INDEX /ProjectName/data/sample${SAMPLE_ID}_1.fastq /ProjectName/data/sample${SAMPLE_ID}_2.fastq
Note that there are a LOT of TopHat parameters you can set. I generally use the defaults for most of them. Above, I specified -G
and a path to a genes.gtf
file, which means that I first want to align reads to the transcriptome (i.e., known RNA), and then map any remaining unmapped reads back to the genome. (One case where a read would map to the genome but not the transcriptome is if it came from a retained intron, so the sequence wouldn't appear in the annotated transcriptome). I also specified -p
to tell TopHat to use multiple threads/cores (this is very helpful in terms of speed). The one parameter I didn't set above, but that I do want to mention, is the -r
parameter, the mate inner distance -- if you have paired-end reads, you should set the mate inner distance to be the fragment length minus twice the read length (if you don't know what this means, talk to the person who gave you the data). The default is 50.
I don't really want to make a script by hand for every single sample, nor do I want to manually submit them all with qsub
. There could be hundreds of samples in my experimen. To get around this, I create a master shell script, tophat.sh
, in the main project directory. The master script creates the sample-specific scripts automatically, using the syntax cat > filename.sh >> EOF
, and submits them with qsub
. For the experiment outlined here, my tophat.sh
file would look like this:
#!/bin/sh
TOPHAT_BINARY=/path/to/tophat/executable/tophat
GENE_REFERENCE=/ProjectName/reference/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
BOWTIE_INDEX=/ProjectName/reference/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome
P=4
for SAMPLE_ID in {1..N}
do
cat > /ProjectName/alignments/scripts/tophat_${SAMPLE_ID}.sh <<EOF
$TOPHAT_BINARY -G $GENE_REFERENCE -p $P -o /ProjectName/alignments/sample${SAMPLE_ID} $BOWTIE_INDEX /ProjectName/data/sample${SAMPLE_ID}_1.fastq /ProjectName/data/sample${SAMPLE_ID}_2.fastq
mv /ProjectName/alignments/sample${SAMPLE_ID}/accepted_hits.bam /ProjectName/alignments/sample${SAMPLE_ID}.bam
EOF
qsub -l mf=20G,h_vmem=5G -pe local $P -m e -M [email protected] /ProjectName/alignments/scripts/tophat_${SAMPLE_ID}.sh
done
Then, from the ProjectName
directory, just execute tophat.sh
:
[afrazee@enigma2 ProjectName]$ sh tophat.sh
Your job tophat_sample1.sh has been submitted
Your job tophat_sample2.sh has been submitted
...
Your job tophat_sampleN.sh has been submitted
You'll get an email at [email protected]
when each job finishes. They'll take a few hours each.
If you look in the scripts
subdirectory of the alignments
folder, you can see the sample-specific scripts that the master script created. When your jobs are finished, your project directory will look like this (note that the final alignments end up in the alignments
directory due to the mv
command at the end of each sample-specific script):
ProjectName/
├── tophat.sh
├── data/
├── sample1_1.fastq
├── sample1_2.fastq
├── sample2_1.fastq
├── sample2_1.fastq
├── ...
├── sampleN_1.fastq
└── sampleN_2.fastq
├── alignments/
├── sample1.bam
├── sample2.bam
├── ...
├── sampleN.bam
├── sample1/
└── [...other TopHat output...]
├── sample2/
└── [...other TopHat output...]
├── ...
└── [...other TopHat output...]
├── sampleN/
└── [...other TopHat output...]
└── scripts/
├── tophat_sample1.sh
├── tophat_sample1.sh
├── ...
└── tophat_sampleN.sh
├── assemblies/
├── scripts/
└── merged/
├── DE_analysis/
├── reference/
└── Homo_Sapiens/
input: read alignments, in .bam
format
output: a transcript assembly, in gtf format. (A GTF file is a plain-text file that specifies locations, strands, and structures of genomic features. Usually there is one row per exon/coding sequence, and the rightmost column specifies which exons belong to which transcripts/genes).
task: reconstruct the set of RNA transcripts generated by each gene in each sample, based on the RNA-seq reads
run how many times: once for each sample in your experiment
After you align RNA-seq reads back to the genome, you are ready to reconstruct the transcripts present in your experiment based on those alignments using Cufflinks. We need to assemble the transcriptomes for each sample separately. The assemblies will be merged (in step 3) to create an overall transcriptome assembly for the experiment. Here, I'm assuming you want to do a totally de novo assembly (i.e., reconstruct the transcripts without using annotation to help). For instructions on doing a guided assembly, and to see the myriad of other options available, check out the manual.
I use the same type of script structure for Cufflinks as I do for TopHat (see previous step): write a master script, cufflinks.sh
, in the main project directory, which will create and submits sample-specific scripts.
My master cufflinks.sh
script would look like this:
#!/bin/sh
CUFFLINKS_BINARY=/path/to/cufflinks/binary/cufflinks
P=4 #use 4 threads/cores
for SAMPLE_ID in {1..N}
do
cat > /ProjectName/assemblies/scripts/cufflinks_${SAMPLE_ID}.sh <<EOF
#!/bin/sh
$CUFFLINKS_BINARY -q -p $P -o /ProjectName/assemblies/sample${SAMPLE_ID} /ProjectName/alignments/sample${SAMPLE_ID}.bam
mv /ProjectName/assemblies/sample${SAMPLE_ID}/transcripts.gtf /ProjectName/assemblies/sample${SAMPLE_ID}_transcripts.gtf
EOF
qsub -l mf=20G,h_vmem=5G -m e -M [email protected] -pe local $P cufflinks_${SAMPLE_ID}.sh
done
The -q
option specifies that you want "quiet" output (the non-quiet output is very, very dense), and the -p
option again specifies how many cores/threads to use. In my experience, multithreading gives you more of a speedup when you use it with TopHat than with Cufflinks (alignment is more easily parallelized), but it's still worth running Cufflinks on more than 1 core if you can.
And then from the ProjectName
directory, I run:
[afrazee@enigma2 ProjectName]$ sh cufflinks.sh
Your job cufflinks_sample1.sh has been submitted
Your job cufflinks_sample2.sh has been submitted
...
Your job cufflinks_sampleN.sh has been submitted
Again, you'll get an email notification when each sample's assembly is finished. When all the jobs are done, your project directory will now look something like this:
ProjectName/
├── tophat.sh
├── cufflinks.sh
├── data/
├── [...all data...]
├── alignments/
├── sample1.bam
├── sample2.bam
├── ...
├── sampleN.bam
├── [...all other TopHat output...]
└── scripts/
├── [...all sample-specific TopHat scripts...]
├── assemblies/
├── sample1_transcripts.gtf
├── sample2_transcripts.gtf
├── ...
├── sampleN_transcripts.gtf
├── sample1/
└── [...other Cufflinks output...]
├── sample2/
└── [...other Cufflinks output...]
├── ...
└── [...other Cufflinks output...]
├── sampleN/
└── [...other Cufflinks output...]
├── scripts/
├── cufflinks_sample1.sh
├── cufflinks_sample2.sh
├── ...
└── cufflinks_sampleN.sh
└── merged/
├── DE_analysis/
├── reference/
└── Homo_Sapiens/
And the assembly step is done!
input: the sample-specific assemblies, in .gtf
format
output: one merged assembly for the entire experiment, in .gtf
format.
task: intelligently combine the sample-specific assemblies into one main transcriptome, which will represent our estimate of the transcript structure present in this particular experiment. Statistical analysis will be performed on the merged assembly.
run how many times: once for the entire project
Compared to the other steps, this one is pretty short!
I always call this file assemblies.txt
and stick it in the assemblies
subdirectory. For this project, the assemblies.txt
file would look like this:
/ProjectName/assemblies/sample1_transcripts.gtf
/ProjectName/assemblies/sample2_transcripts.gtf
...
/ProjectName/assemblies/sampleN_transcripts.gtf
List each file out explicitly - the ... is just for illustrative purposes here :)
Once you make assemblies.txt
, you just need to run cuffmerge
on that assembly list, with a few other parameters. You'll use a file from your reference. Here is my cuffmerge.sh
script, which I put in the main project directory:
#!/bin/sh
#$ -cwd -l mf=20G,h_vmem=5G -pe local 4 -m e -M [email protected]
REFERENCE_SEQ=/ProjectName/reference/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa
CUFFMERGE_BINARY=/path/to/cuffmerge/binary/cuffmerge
P=4
$CUFFMERGE_BINARY -s $REFERENCE_SEQ -p $P -o /ProjectName/assemblies/merged /ProjectName/assemblies/assemblies.txt
Then I just do:
[afrazee@enigma2 ProjectName]$ qsub cuffmerge.sh
Your job "cuffmerge.sh" has been submitted
A fun trick: when you have a line in a bash script prefixed with #$
, qsub will parse all its arguments from there. So, the second line in the cuffmerge.sh
script takes care of all my memory requests and whatnot, so I don't have to worry about it when I actually run my qsub
command.
The output, merged.gtf
, will be written to the merged
folder specified by -o
. Again, there are several other options, including one to do a guided merge (merge the assembled transcripts with annotated transcripts), detailed in the manual -- but this is my general template.
The project directory now looks like:
ProjectName/
├── tophat.sh
├── cufflinks.sh
├── cuffmerge.sh
├── data/
├── [...all data...]
├── alignments/
├── sample1.bam
├── sample2.bam
├── ...
├── sampleN.bam
├── [...all other TopHat output...]
└── scripts/
├── [...all sample-specific TopHat scripts...]
├── assemblies/
├── sample1_transcripts.gtf
├── sample2_transcripts.gtf
├── ...
├── sampleN_transcripts.gtf
├── [...all other Cufflinks output...]
├── scripts/
├── [...all sample-specific Cufflinks scripts...]
└── merged/
├── merged.gtf
└── logs/
├── DE_analysis/
├── reference/
└── Homo_Sapiens/
Congratulations! You now have a transcriptome assembly (merged.gtf
) for your experiment, so you can do differential expression analysis (I mean, we already made a DE_analysis
folder...it's just sitting there, waiting to be filled with statistics!). Or you can visualize your assembly. Or you can discover new isoforms. Or any number of other awesome things! The world of transcript analysis is your oyster. Have fun :)
Hi,
I have the same question as Alyssa and would be very thankful for your feedback.
Thanks!