Expression Quantification

Intro

We have now generated, assessed and annotated our de-novo transcriptome assembly. Yay! So let's start using it.

The trinity wiki describes many of the downstream analyses that can be conducted and provides tools for doing so.

But, presumably, one of the first steps will be to use it to quantify the expression of the different samples we used to build our assembly.

For that we will use a pseuso-aligner, namely salmon.

Salmon is a pseudo-aligner, a quantification tool of the latest generation. It allows for the probabilistic calculation of transcripts (splicing isoforms) abundance. It is fast and accurate, but a drawback is it relies on an existing transcriptome reference.

Aim

You should be used to it by now, check the webpage as well as salmon -h to devise the command line.

Hints

we did not perform an aligment
the index was generated for you, you will find it in /data/raw_data/reference/indices/salmon/zygotic-embryogenesis-expression-filtered-transcript_salmon-v1-8-0
use the trimmed data you generated
be curious and look at any options that might be relevant in the help page
our data is strand specific, inward facing and used the classical Illumina stranded protocol that generates reads on the reverse strand.

Solution

cd
mkdir salmon
salmon quant -l A \
-i /data/raw_data/reference/indices/salmon/\
zygotic-embryogenesis-expression-filtered-transcript_salmon-v1-8-0 \
-1 trimmomatic/P11562_111_S10_L001_subset_sortmerna_trimmomatic_1.fq.gz \
-2 trimmomatic/P11562_111_S10_L001_subset_sortmerna_trimmomatic_2.fq.gz \
-o salmon/P11562_111_S10_L001_subset_sortmerna_trimmomatic --seqBias --gcBias \
--posBias -p 8 --numGibbsSamples 30

Appendix

Index generation

Obviously, creating the index from the small assembly we generated would not have been very helpful. Rather, we created it from our real assembly.

For this workshop

I actually took a transcriptome I have built from a set of 1930 RNA-Seq samples. That set was normalised for kmer content (using jellyfish and KAT) prior to being submitted to trinity. I selected only the transcripts that aligned uniquely to the genome using GMAP (this is arbitrary and one should keep the transcripts mapping in multiple locations. Here the rationale was to bring down the number of genes. The multimapping transcripts are more likely to be transposable elements (LTR - Long Terminal Repeats). I further reducted the number of transcripts in the assembly by selecting those that are expressed in the zygotic embryogenesis study (I ran salmon against the transcriptome above on the diginorm data from the study and filtered away all transcripts without expression). That final set of transcripts (about 260,000 coming from about 120,000 genes) I then used to create the salmon index using: salmon index -t zygotic-embryogenesis-expression-filtered-transcript.fa -i zygotic-embryogenesis-expression-filtered-transcript_salmon-v1-8-0

In reality

We used an ab-initio set of tools (Augustus and Eugene) on the genome using multiple evidences (proteins, RNA-Seq, etc.) to predict gene models. We then ran trinity in genome guided mode on our RNA-Seq data to create the transcripts. We added some repeat elements as these are also expressed (LTR - Long Terminal Repeats) and that become our transcriptome: Pabies1.0-all-phase.gff3.CDSandLTR-TE.fasta.

Taking that transcript file, we simply ran salmon index -t Pabies1.0-all-phase.gff3.CDSandLTR-TE.fasta -i Pabies1.0-all-phase.gff3.CDSandLTR-TE_salmon-version-1dot9dot0 on it.

Salmon can actually use the genomic sequence as a decoy, check out how to there. We did not for spruce in the currently public assembly as it has > 10,000,000 scaffolds and salmon fails using that (memory issues). It is actually not any better with our new reference that has the 12 chromosomes, but a highly repetitive genome (80% transposable elements) and is 6 times the size of the human genome (19.6 Gbp).

Trimmed data

If you do not have the trimmed data, you can use the raw data located in /data/raw_data/illumina_pe/

cd
mkdir trimmomatic
cd trimommatic
ln -s /data/raw_data/illumina_pe/P11562_111_S10_L001_subset_1,2].fq.gz .
cd

Then proceed with the command above (edit the -1 and -2 arguments.

nicolasDelhomme/salmon.md