PS3.md

GENE 760 – Problem Set 3

The purpose of this problem set is to familiarize students with the analysis of mRNA-seq data. By the end of this problem set, you will have learned how to use:

STAR to map spliced RNA-seq reads to a genome
HTSeq to quantify gene expression as read counts
DESeq2 to perform differential expression analysis
DAVID and EnrichR to perform gene ontology analysis on a list of genes

Students are to submit a gzipped tarball called <Your_NetID>_PS3.tar.gz containing:

<Your_NetID>_PS3_answers.md - Answers to the questions below
<Your_NetID>_filter_counts.py - Commented Python script for filtering read counts
<Your_NetID>_filtered_genes.txt - list of genes from Python filtering step
<Your_NetID>_DEseq2.R - Commented R script used to perform differential expression analysis
<Your_NetID>_2wk_heatmap.png
<Your_NetID>_6wk_heatmap.png
<Your_NetID>_parseGTF.py - Commented Python script for mapping gene IDs to transcriptIDs

The tarball should be submitted to /ysm-gpfs/project/gene760/gene760_2019/DROPBOX/PS3/ by 9AM on Tuesday, March26

Remember to run commands and process all data in your project folder.

To save space, delete or compress (gzip) any files you no longer need, or move gzipped files/directories onto scratch.

For this problem set, you will be working with data from Gjoneska et al., Nature 518:365-369 (2015).

Datasets are located in /ysm-gpfs/project/gene760/gene760_2019/DATA/PS3

In this directory, you will find 12 gzipped FASTQ sequence files:

Control samples(2 week timepoint):
- CK_2wk_1_R1.fastq.gz, CK_2wk_1_R2.fastq.gz
- CK_2wk_2_R1.fastq.gz, CK_2wk_2_R2.fastq.gz
- CK_2wk_3_R1.fastq.gz, CK_2wk_3_R2.fastq.gz
Treatment samples(2 week timepoint):
- CKp25_2wk_1_R1.fastq.gz, CKp25_2wk_1_R2.fastq.gz
- CKp25_2wk_2_R1.fastq.gz, CKp25_2wk_2_R2.fastq.gz
- CKp25_2wk_3_R1.fastq.gz, CKp25_2wk_3_R2.fastq.gz

Pre-built genome index files for mm10 are located in:

/ysm-gpfs/project/gene760/gene760_2019/REFERENCE/mm10/STAR_index/

A transcript annotation file in GTF format is located at:

/ysm-gpfs/project/gene760/gene760_2019/ANNOTATION/gencode.vM20.annotation.gtf

STAR and HTSeq

Note: Gjoneska et al. generate strand-specific RNA-seq reads using Illumina’s TruSeq Stranded Total RNA prep kit (see Methods)—i.e. only the strand from first strand synthesis is sequenced.

Question 1

Use STAR to align RNA-seq reads for the treatment and control samples from the 2-week time point. Report alignments with no more than 2 mismatches. Use the genome directory provided in the REFERENCE directory and output a sorted BAM file.

1a: Write the commands you used and a description of what each command does, including the meaning of any arguments.


Please write all answers (code and comments) within code fences

Note: Code fences are a pair of three backticks: "```"

1b: What are the contents of the genome directory, and what does STAR use them for?


Write answer within code fences

Question 2

Use HTSeq to quantify gene expression for each sample. HTSeq uses alignments generated by STAR to calculate transcript abundances. A tutorial on how to use the htseq-count script is here: http://htseq.readthedocs.io/en/master/count.html. Use the transcript annotation file provided.

** Note: For paired-end data, the alignment must first be sorted either by read name or by alignment position. By default, htseq-count expects alignments to be sorted by name.**

2a: Write the commands you used and a description of what each command does.


Write answer (code and comments) within code fences

2b: In your own words, describe the logic HTSeq uses to assign reads to genes (in union mode).


Write answer within code fences

2c: What are the advantages of stranded RNA-seq reads for quantifying gene expression?


Write answer within code fences

For the following questions, use reads counts from the 2-week time point (generated above) as well as counts from the 6-week time point. Read counts for control and treatment samples from the 6-week time point are located in /ysm-gpfs/project/gene760/gene760_2019/DATA/PS3:

CK_6wk_1_gencodeM20_count.txt
CK_6wk_2_gencodeM20_count.txt
CK_6wk_3_gencodeM20_count.txt
CKp25_6wk_1_gencodeM20_count.txt
CKp25_6wk_2_gencodeM20_count.txt
CKp25_6wk_3_gencodeM20_count.txt

Question 3

Filter read counts with Python

When performing differential expression analysis, we are only interested in expressed genes. Write a python script that will output a directory containing the following (13 in total): a list of genes with at least 20 reads in at least one of the twelve samples (2wk and 6wk time points) and the read counts of the filtered gene list for each sample.

You should use argparse to take the following command line arguments:

<Your_NetID>_filter_counts.py -I <input_file_directory> -O <output_file_directory>

3a: Save your script as <Your_NetID>_ filter_counts.py

3b: How many genes passed this filtration step? Upload your gene list as <Your_NetID>_filtered_genes.txt.


Write answer within code fences

Question 4

**DEseq2 utilizes read counts from HTSeq to build statistical models of gene expression and identify differential expressed genes. An extensive description of how DEseq2 works can be found here: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html. **

Perform differential expression analysis for the 2-week and 6-week time points independently using the filtered read counts generated in question 3.

4a: Save the R script containing all the commands you used to perform differential expression analysis as <Your_NetID>_DEseq2.R.

4b: How many differentially expressed genes with an adjusted p-value less than 0.01 are identified at 2 weeks? And at 6 weeks?


Write answer within code fences

4c: Use R to generate heatmaps showing the top differentially expressed genes for each time point. Save the plots as <Your_NetID>_2wk_heatmap.png and <Your_NetID>_6wk_heatmap.png

Question 5

Using the DEseq2 output you generated above, determine the number and identity of genes specifically upregulated in each time point (upregulated at one time point but not the other).

5a: How many genes are significantly upregulated at a log2(FC) > 1? Save these to a text file for each timepoint (for use in question 7).


Write answer within code fences

5b: Write the commands you used to determine this and a description of what the command does.


Write answer within code fences

5c: How many exclusive upregulated genes are identified at 2 weeks? And at 6 weeks?


Write answer within code fences

5d: DESeq reports the probability that a gene is differentially expressed based on fold-change and level of expression. Why is incorporating the level of expression into the model important?


Write answer within code fences

Question 6.

** Write a Python script to map the Gene IDs/names to transcript IDs by parsing the provided GTF file. Your script must use argparse to take inputs from the command line and output a list of transcript IDs to the output file:**

--gtf <input_GTF_file> --genes <input_gene_file> -O <output_file_name>

6a: Save your script as _<Your_NetID>parseGTF.py

6b: What is the benefit of translating GeneIDs to transcript IDs?


Write answer within code fences

Question 7.

Functional annotation and gene ontology analysis with DAVID and EnrichR

Use DAVID (http://david.ncifcrf.gov) and EnrichR (http://amp.pharm.mssm.edu/Enrichr/) to identify GO Biological Process enrichments for genes upregulated at each timepoint.

7a: What are the top 10 GO_BP_FAT terms from DAVID for 2 and 6-week time points?


Write answer within code fences

7b: What are the top 10 GO Biological Process 2018 terms from EnrichR? How do the results from these methods compare?


Write answer within code fences

7c: What biological processes do your results implicate in Alzheimer’s disease? (5-6 sentences)


Write answer within code fences

7d: How do your results compare to those found in Gjoneska et al.? (1-2 sentences)


Write answer within code fences

dburkhardt/PS3.md