cbare · October 11, 2012 12:08
diff --git a/predictive_modeling_ccle.Rmd b/predictive_modeling_ccle.Rmd
 Predictive Modeling: Drug Response in Cancer Cell Lines
 ======================================================= 

 This is a demo of **Knitr**, an R package for authoring executable documents, documents that mix formatted text, source code and graphical output. I've used In Sock's demos of drug response in [CCLE][1] data, but I've probably gotten most of the analysis mixed up. Apologies to In Sock.

 Analysis of cancer cell lines for drug sensitity using:
  * [Cancer Cell Line Encyclopedia][1]
  * [Gene Set Enrichment Analysis][2]

 Workflow
 --------
 The predictive modeling workflow is composed of three consecutive steps:
  1. Applying predictive models
  2. Bootstrapping feature selection
  3. Analysis with selected features

 Several predictive models (Lasso, Elastic Net, Ridge, SVM, PCR, PLS, SPLS, etc.) are trained and cross-validated. Bootstrapping selects features that are biologically relevant with respect to prior knowledge such as pathways. Gold standard predictors and gene-set enrichment analysis confirm selected features against KEGG, BIOCARTA, REACTOME and GO term enrichment.

 ![predictive modeling workflow](figures/workflow.png)

 Load data from Synapse
 ----------------------

 First, we load the [synapseClient][3] package and log in.

 ```{r results='hide', message=FALSE, warning=FALSE}
 library(RJSONIO)
 library(synapseClient)
 ```

 ```{r synapseLogin}
 synapseLogin('[email protected]','san juan island')
 ```

 Next, we load several datasets from Synapse from the Cancer Cell Line Project - (syn170143). This project holds datasets for gene expression, copy number variation, mutation and drug sensitivity.

 ```{r download-data-from-synapse}
 # copy number variation
 cnv_entity <- loadEntity('syn269019')
 cnv <- cnv_entity$objects$eSet_copy

 # mutation data
 oncomap_entity <- loadEntity('syn269021')
 oncomap <- oncomap_entity$objects$eSet_oncomap

 # gene expression
 expr_entity <- loadEntity('syn269056')
 expr <- expr_entity$objects$eSet_expr

 # drug
 drug_entity <- loadEntity('syn269024')
 adf_drug <- drug_entity$objects$adf_drug
 ```
 Preprocessing
 -------------

 ```{r results='hide', message=FALSE}
 require(predictiveModeling)
 ```

 ```{r source-insock-code, echo=FALSE, results='hide', message=FALSE}
 source("~/COMPBIO/trunk/users/jang/R5/myEnetModel.R")
 source("~/COMPBIO/trunk/users/jang/R5/bootstrapPredictiveModel.R")
 ```

 Let's concatenate the molecular feature dataset(exp, cnv, and mut), filter out 'NA's and match the sample names with drug data. The we'll filter out redundant and low variance data and scale the remainder.

 ```{r preprocess, tidy=FALSE, eval=FALSE}
 # combine all features into one matrix
 featureData <- createAggregateFeatureDataSet(list(expr=expr, copy=cnv, mut=oncomap))

 # filter rows with NAs
 featureData_filtered <- filterNasFromMatrix(featureData, filterBy = "rows")
 dataSets_ccle <- createFeatureAndResponseDataList(t(featureData_filtered),adf_drug)

 # pick just the first drug
 kk=1

 # to make computation fast, we prefilter observation matrix with following thresholds
 filteredData <- filterPredictiveModelData(dataSets_ccle$featureData,
                                          dataSets_ccle$responseData[,kk,drop=FALSE],
                                          featureVarianceThreshold = 0.01,
                                          corPValThresh = 0.1)

 # In CNV, there might be redundant features thanks to sharing the same position
 filteredFeatureData  <- filteredData$featureData
 filteredFeatureData  <- unique(filteredFeatureData, MARGIN=2)
 filteredResponseData <- filteredData$responseData

 ## scale these data   
 filteredFeatureDataScaled <- scale(filteredFeatureData)
 filteredResponseDataScaled <- scale(filteredResponseData) 
 ```

 Predictive Modeling
 -------------------
 Here, we'll make use of [foreach and doMC][4] to spread the boostrapping runs over multiple cores.

 ```{r results='hide', message=FALSE}
 require(foreach)
 require(doMC)
 registerDoMC()
 ```

 We'll do 100 runs of elastic net (which takes a long time!). Elastic net works like this:

 $latex \beta = \underset{\beta}{\operatorname{argmax}}(\left \| y-X\beta \right \|^2 + \lambda (\alpha \left \| \beta \right \|_2^2 + (1-\alpha) \left \| \beta \right \|_1))$


 ```{r bootstrap, eval=FALSE}
 # For Elastic Net, we set two parameters: alpha and lambda
 alphas <- unique(createENetTuneGrid()[,1])
 lambdas <- createENetTuneGrid(alphas = 1)[,2]

 # perform 100 bootstrapping runs
 resultsModel <-bootstrapPredictiveModel(filteredFeatureDataScaled,
                                        filteredResponseDataScaled,
                                        model = myEnetModel$new(),
                                        numBootstrap=10,
                                        alpha=alphas, lambda = lambdas)
 ```

 Select predictors that are consistently highly ranked over the bootstrapping runs.

 ```{r totally-cheating, echo=FALSE, results='hide', message=FALSE}
 e <- loadEntity('syn1437787')
 resultsModel <- e$objects$resultsModel
 ```

 ```{r rank-predictors}
 # Rank predictors and combine average over bootstrapping runs
 RankTotals<-c()
 for(k in 1:length(resultsModel)){    
  RankTotals<-cbind(RankTotals,rank(abs(as.numeric(resultsModel[[k]][-1])),ties.method="min")/length(resultsModel[[k]]))
 }

 # select top influences
 rownames(RankTotals)<-rownames(resultsModel[[k]])[-1]  
 reference <- apply(RankTotals,1,sum)        
 X<-sort(reference, decreasing =T, index.return =T)
 sigCoefName<-names(reference)[X$ix]
 ```
 Let's look at the distribution of rankings among the predictors.

 ```{r histogram}
 hist(reference, main="Ranking predictors", xlab="predictor rank averaged over 100 bootstrapping runs", col="light grey")
 ```

 Every analysis in which gene expression plays a part must contain at least one heatmap. This is a very important rule which we don't want to break. So, here it is:

 ```{r heatmap}
 genes <- unique(gsub(pattern="[_expr|_copy|_mut]",replacement="", x=sigCoefName[1:100]))
 genes <- intersect(genes, rownames(assayData(expr)$exprs))
 m <- (assayData(expr)$exprs)[genes,]
 heatmap(m, col=topo.colors(100), cexRow=0.4)
 ```

 Conclusion
 ----------

 Knitr is cool!

  * mixes formatted text, code and graphics
  * outputs HTML or PDF.
  * produces documents programmatically, which means they can be shared, re-run and versioned.
  * works with markdown, which I'm hoping will mesh nicely with the wiki features coming to Synapse.
  * still has some kinks. I had issues with Latex equation rendering and debugging is hard.
  * long running analyses should be pre-computed and stored in Synapse.


 [1]: http://www.broadinstitute.org/software/cprg/?q=node/11 "CCLE"
 [2]: http://www.broadinstitute.org/gsea/ "GSEA"
 [3]: https://sagebionetworks.jira.com/wiki/display/SYNR/Home
 [4]: http://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf
	Predictive Modeling: Drug Response in Cancer Cell Lines
	=======================================================

	This is a demo of Knitr, an R package for authoring executable documents, documents that mix formatted text, source code and graphical output. I've used In Sock's demos of drug response in [CCLE][1] data, but I've probably gotten most of the analysis mixed up. Apologies to In Sock.

	Analysis of cancer cell lines for drug sensitity using:
	* [Cancer Cell Line Encyclopedia][1]
	* [Gene Set Enrichment Analysis][2]

	Workflow
	--------
	The predictive modeling workflow is composed of three consecutive steps:
	1. Applying predictive models
	2. Bootstrapping feature selection
	3. Analysis with selected features

	Several predictive models (Lasso, Elastic Net, Ridge, SVM, PCR, PLS, SPLS, etc.) are trained and cross-validated. Bootstrapping selects features that are biologically relevant with respect to prior knowledge such as pathways. Gold standard predictors and gene-set enrichment analysis confirm selected features against KEGG, BIOCARTA, REACTOME and GO term enrichment.

	![predictive modeling workflow](figures/workflow.png)

	Load data from Synapse
	----------------------

	First, we load the [synapseClient][3] package and log in.

	```{r results='hide', message=FALSE, warning=FALSE}
	library(RJSONIO)
	library(synapseClient)
	```

	```{r synapseLogin}
	synapseLogin('[email protected]','san juan island')
	```

	Next, we load several datasets from Synapse from the Cancer Cell Line Project - (syn170143). This project holds datasets for gene expression, copy number variation, mutation and drug sensitivity.

	```{r download-data-from-synapse}
	# copy number variation
	cnv_entity <- loadEntity('syn269019')
	cnv <- cnv_entity$objects$eSet_copy

	# mutation data
	oncomap_entity <- loadEntity('syn269021')
	oncomap <- oncomap_entity$objects$eSet_oncomap

	# gene expression
	expr_entity <- loadEntity('syn269056')
	expr <- expr_entity$objects$eSet_expr

	# drug
	drug_entity <- loadEntity('syn269024')
	adf_drug <- drug_entity$objects$adf_drug
	```
	Preprocessing
	-------------

	```{r results='hide', message=FALSE}
	require(predictiveModeling)
	```

	```{r source-insock-code, echo=FALSE, results='hide', message=FALSE}
	source("~/COMPBIO/trunk/users/jang/R5/myEnetModel.R")
	source("~/COMPBIO/trunk/users/jang/R5/bootstrapPredictiveModel.R")
	```

	Let's concatenate the molecular feature dataset(exp, cnv, and mut), filter out 'NA's and match the sample names with drug data. The we'll filter out redundant and low variance data and scale the remainder.

	```{r preprocess, tidy=FALSE, eval=FALSE}
	# combine all features into one matrix
	featureData <- createAggregateFeatureDataSet(list(expr=expr, copy=cnv, mut=oncomap))

	# filter rows with NAs
	featureData_filtered <- filterNasFromMatrix(featureData, filterBy = "rows")
	dataSets_ccle <- createFeatureAndResponseDataList(t(featureData_filtered),adf_drug)

	# pick just the first drug
	kk=1

	# to make computation fast, we prefilter observation matrix with following thresholds
	filteredData <- filterPredictiveModelData(dataSets_ccle$featureData,
	dataSets_ccle$responseData[,kk,drop=FALSE],
	featureVarianceThreshold = 0.01,
	corPValThresh = 0.1)

	# In CNV, there might be redundant features thanks to sharing the same position
	filteredFeatureData <- filteredData$featureData
	filteredFeatureData <- unique(filteredFeatureData, MARGIN=2)
	filteredResponseData <- filteredData$responseData

	## scale these data
	filteredFeatureDataScaled <- scale(filteredFeatureData)
	filteredResponseDataScaled <- scale(filteredResponseData)
	```

	Predictive Modeling
	-------------------
	Here, we'll make use of [foreach and doMC][4] to spread the boostrapping runs over multiple cores.

	```{r results='hide', message=FALSE}
	require(foreach)
	require(doMC)
	registerDoMC()
	```

	We'll do 100 runs of elastic net (which takes a long time!). Elastic net works like this:

	$latex \beta = \underset{\beta}{\operatorname{argmax}}(\left \\| y-X\beta \right \\|^2 + \lambda (\alpha \left \\| \beta \right \\|_2^2 + (1-\alpha) \left \\| \beta \right \\|_1))$


	```{r bootstrap, eval=FALSE}
	# For Elastic Net, we set two parameters: alpha and lambda
	alphas <- unique(createENetTuneGrid()[,1])
	lambdas <- createENetTuneGrid(alphas = 1)[,2]

	# perform 100 bootstrapping runs
	resultsModel <-bootstrapPredictiveModel(filteredFeatureDataScaled,
	filteredResponseDataScaled,
	model = myEnetModel$new(),
	numBootstrap=10,
	alpha=alphas, lambda = lambdas)
	```

	Select predictors that are consistently highly ranked over the bootstrapping runs.

	```{r totally-cheating, echo=FALSE, results='hide', message=FALSE}
	e <- loadEntity('syn1437787')
	resultsModel <- e$objects$resultsModel
	```

	```{r rank-predictors}
	# Rank predictors and combine average over bootstrapping runs
	RankTotals<-c()
	for(k in 1:length(resultsModel)){
	RankTotals<-cbind(RankTotals,rank(abs(as.numeric(resultsModel[[k]][-1])),ties.method="min")/length(resultsModel[[k]]))
	}

	# select top influences
	rownames(RankTotals)<-rownames(resultsModel[[k]])[-1]
	reference <- apply(RankTotals,1,sum)
	X<-sort(reference, decreasing =T, index.return =T)
	sigCoefName<-names(reference)[X$ix]
	```
	Let's look at the distribution of rankings among the predictors.

	```{r histogram}
	hist(reference, main="Ranking predictors", xlab="predictor rank averaged over 100 bootstrapping runs", col="light grey")
	```

	Every analysis in which gene expression plays a part must contain at least one heatmap. This is a very important rule which we don't want to break. So, here it is:

	```{r heatmap}
	genes <- unique(gsub(pattern="[_expr\|_copy\|_mut]",replacement="", x=sigCoefName[1:100]))
	genes <- intersect(genes, rownames(assayData(expr)$exprs))
	m <- (assayData(expr)$exprs)[genes,]
	heatmap(m, col=topo.colors(100), cexRow=0.4)
	```

	Conclusion
	----------

	Knitr is cool!

	* mixes formatted text, code and graphics
	* outputs HTML or PDF.
	* produces documents programmatically, which means they can be shared, re-run and versioned.
	* works with markdown, which I'm hoping will mesh nicely with the wiki features coming to Synapse.
	* still has some kinks. I had issues with Latex equation rendering and debugging is hard.
	* long running analyses should be pre-computed and stored in Synapse.


	[1]: http://www.broadinstitute.org/software/cprg/?q=node/11 "CCLE"
	[2]: http://www.broadinstitute.org/gsea/ "GSEA"
	[3]: https://sagebionetworks.jira.com/wiki/display/SYNR/Home
	[4]: http://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf