Immanuel Purushothaman ipurusho

###Download STAR### Obtain STAR source from https://github.com/alexdobin/STAR

Add the following to your .bashrc file and source it: export PATH=/path/to/STAR/bin/:$PATH

###Generate Reference Genome

In order to perform a pairwise splicing analysis, sorted bam files per sample are needed as well as reference annotation in the form of a GTF file

###Download rMATS Download rMATS at http://rnaseq-mats.sourceforge.net/download.html

###Running rMATS Via python, you can run rMATs using the following example command:

python /path/to/rMATS/directory/RNASeq-MATS.py -b1 conditon1_replicate_1.bam,condition_1_replicate_2.bam -b2 condition_2_replicate_1.bam,condition_2_replicate_2.bam -gtf /path/to/gtf/file -o /path/to/results/directory -t single -len {read length}

###Download R scripts

Download the following R script which contains necessary functions: https://gist.github.com/ipurusho/64d6dbf1a32aa7c83f665785155e494b

Then, load the covariate_contribution.R script into your R environment.

###Performing Covariate Analysis Once you have obtained a variance stabilized count matrix and organized your meta data (meta data rows should correspond to VST columns), you can use the top.var function (now in your R environment) to filter for the top 500 (or more) variable genes for input into PCA.

###Download modified RRHO scripts Obtain the modified R scripts from https://gist.github.com/ipurusho/8e4beef4fb51c2485c9f and source it in your R environment.

###Prepare the data Create two separate matrices, one for P value and one for log fold change from your differential analysis. The rows should be genes and the columns should be the comparisons (column order should be consistent between the two matrices). For example:

P.vals = as.data.frame(cbind(Comparison1 = Comparison1[intersect.genes,"P.Value"],
                             Comparison2 = Comparison2[intersect.genes,"P.Value"],
 Comparison3 = Comparison3[intersect.genes,"P.Value"]))

tSNE

Download the following script: https://gist.github.com/ipurusho/44e06d43aab0a7dd2641589a4fd3351c

In R, write the variance stabilized values per sample, subsetting for the top 500 variable genes, to a file #without# row and column labels. You can then use the tsne.py script as follows:

python /path/to/tSNE.py /path/to/tsne_input_vsd.csv 30 /path/to/output_file.csv

Where 30 is the perplexity value, which is dependent on sample size for optimum output (see documentation). The output file will contain two columns, dimension 1, dimension 2. The rows correspond to the Samples which will be in the same order as the input vsd columns. You can load this data back into R and visualize it using your preferred method.

	import sys
	import os
	import re
	import glob


	#sys.argv[1] = action
	#sys.argv[2] = top level folder
	#sys.argv[3] = read
	def subdirectories(args):

	top.var<-function (x, ntop = 500)
	{
	require("genefilter")
	rv = rowVars(x)
	select = order(rv, decreasing = TRUE)[seq_len(ntop)]
	topvar = x[select, ]

	return(topvar)
	}

	### "transcript" parameter is the original matrix of normalized counts
	### "signifcant" parameter is the filtered matrix of significant genes (using desired cutoff)
	### "email" is an email address needed to run the web service

	go.david<-function(transcript,significant,email){
	require("RDAVIDWebService")
	if(dim(significant)[1] !=0){
	david<-DAVIDWebService$new(email=email)
	background<-addList(david, rownames(transcript), idType="ENSEMBL_GENE_ID",listName="full_transcript", listType="Background")
	sig.genes<-addList(david, rownames(significant), idType="ENSEMBL_GENE_ID",listName="significant", listType="Gene")

	import sys
	import numpy as np
	from sklearn.manifold import TSNE


	my_data = np.genfromtxt(str(sys.argv[1]), delimiter=',')

	model = TSNE(n_components=2, random_state=0,perplexity=int(sys.argv[2]))
	tsne = model.fit_transform(np.transpose(my_data))
	np.savetxt(str(sys.argv[3]), tsne, delimiter=",")