Created
October 11, 2012 12:08
-
-
Save cbare/3871900 to your computer and use it in GitHub Desktop.
Predictive modeling demo Knitr-ized
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Predictive Modeling: Drug Response in Cancer Cell Lines | |
======================================================= | |
This is a demo of **Knitr**, an R package for authoring executable documents, documents that mix formatted text, source code and graphical output. I've used In Sock's demos of drug response in [CCLE][1] data, but I've probably gotten most of the analysis mixed up. Apologies to In Sock. | |
Analysis of cancer cell lines for drug sensitity using: | |
* [Cancer Cell Line Encyclopedia][1] | |
* [Gene Set Enrichment Analysis][2] | |
Workflow | |
-------- | |
The predictive modeling workflow is composed of three consecutive steps: | |
1. Applying predictive models | |
2. Bootstrapping feature selection | |
3. Analysis with selected features | |
Several predictive models (Lasso, Elastic Net, Ridge, SVM, PCR, PLS, SPLS, etc.) are trained and cross-validated. Bootstrapping selects features that are biologically relevant with respect to prior knowledge such as pathways. Gold standard predictors and gene-set enrichment analysis confirm selected features against KEGG, BIOCARTA, REACTOME and GO term enrichment. | |
 | |
Load data from Synapse | |
---------------------- | |
First, we load the [synapseClient][3] package and log in. | |
```{r results='hide', message=FALSE, warning=FALSE} | |
library(RJSONIO) | |
library(synapseClient) | |
``` | |
```{r synapseLogin} | |
synapseLogin('[email protected]','san juan island') | |
``` | |
Next, we load several datasets from Synapse from the Cancer Cell Line Project - (syn170143). This project holds datasets for gene expression, copy number variation, mutation and drug sensitivity. | |
```{r download-data-from-synapse} | |
# copy number variation | |
cnv_entity <- loadEntity('syn269019') | |
cnv <- cnv_entity$objects$eSet_copy | |
# mutation data | |
oncomap_entity <- loadEntity('syn269021') | |
oncomap <- oncomap_entity$objects$eSet_oncomap | |
# gene expression | |
expr_entity <- loadEntity('syn269056') | |
expr <- expr_entity$objects$eSet_expr | |
# drug | |
drug_entity <- loadEntity('syn269024') | |
adf_drug <- drug_entity$objects$adf_drug | |
``` | |
Preprocessing | |
------------- | |
```{r results='hide', message=FALSE} | |
require(predictiveModeling) | |
``` | |
```{r source-insock-code, echo=FALSE, results='hide', message=FALSE} | |
source("~/COMPBIO/trunk/users/jang/R5/myEnetModel.R") | |
source("~/COMPBIO/trunk/users/jang/R5/bootstrapPredictiveModel.R") | |
``` | |
Let's concatenate the molecular feature dataset(exp, cnv, and mut), filter out 'NA's and match the sample names with drug data. The we'll filter out redundant and low variance data and scale the remainder. | |
```{r preprocess, tidy=FALSE, eval=FALSE} | |
# combine all features into one matrix | |
featureData <- createAggregateFeatureDataSet(list(expr=expr, copy=cnv, mut=oncomap)) | |
# filter rows with NAs | |
featureData_filtered <- filterNasFromMatrix(featureData, filterBy = "rows") | |
dataSets_ccle <- createFeatureAndResponseDataList(t(featureData_filtered),adf_drug) | |
# pick just the first drug | |
kk=1 | |
# to make computation fast, we prefilter observation matrix with following thresholds | |
filteredData <- filterPredictiveModelData(dataSets_ccle$featureData, | |
dataSets_ccle$responseData[,kk,drop=FALSE], | |
featureVarianceThreshold = 0.01, | |
corPValThresh = 0.1) | |
# In CNV, there might be redundant features thanks to sharing the same position | |
filteredFeatureData <- filteredData$featureData | |
filteredFeatureData <- unique(filteredFeatureData, MARGIN=2) | |
filteredResponseData <- filteredData$responseData | |
## scale these data | |
filteredFeatureDataScaled <- scale(filteredFeatureData) | |
filteredResponseDataScaled <- scale(filteredResponseData) | |
``` | |
Predictive Modeling | |
------------------- | |
Here, we'll make use of [foreach and doMC][4] to spread the boostrapping runs over multiple cores. | |
```{r results='hide', message=FALSE} | |
require(foreach) | |
require(doMC) | |
registerDoMC() | |
``` | |
We'll do 100 runs of elastic net (which takes a long time!). Elastic net works like this: | |
$latex \beta = \underset{\beta}{\operatorname{argmax}}(\left \| y-X\beta \right \|^2 + \lambda (\alpha \left \| \beta \right \|_2^2 + (1-\alpha) \left \| \beta \right \|_1))$ | |
```{r bootstrap, eval=FALSE} | |
# For Elastic Net, we set two parameters: alpha and lambda | |
alphas <- unique(createENetTuneGrid()[,1]) | |
lambdas <- createENetTuneGrid(alphas = 1)[,2] | |
# perform 100 bootstrapping runs | |
resultsModel <-bootstrapPredictiveModel(filteredFeatureDataScaled, | |
filteredResponseDataScaled, | |
model = myEnetModel$new(), | |
numBootstrap=10, | |
alpha=alphas, lambda = lambdas) | |
``` | |
Select predictors that are consistently highly ranked over the bootstrapping runs. | |
```{r totally-cheating, echo=FALSE, results='hide', message=FALSE} | |
e <- loadEntity('syn1437787') | |
resultsModel <- e$objects$resultsModel | |
``` | |
```{r rank-predictors} | |
# Rank predictors and combine average over bootstrapping runs | |
RankTotals<-c() | |
for(k in 1:length(resultsModel)){ | |
RankTotals<-cbind(RankTotals,rank(abs(as.numeric(resultsModel[[k]][-1])),ties.method="min")/length(resultsModel[[k]])) | |
} | |
# select top influences | |
rownames(RankTotals)<-rownames(resultsModel[[k]])[-1] | |
reference <- apply(RankTotals,1,sum) | |
X<-sort(reference, decreasing =T, index.return =T) | |
sigCoefName<-names(reference)[X$ix] | |
``` | |
Let's look at the distribution of rankings among the predictors. | |
```{r histogram} | |
hist(reference, main="Ranking predictors", xlab="predictor rank averaged over 100 bootstrapping runs", col="light grey") | |
``` | |
Every analysis in which gene expression plays a part must contain at least one heatmap. This is a very important rule which we don't want to break. So, here it is: | |
```{r heatmap} | |
genes <- unique(gsub(pattern="[_expr|_copy|_mut]",replacement="", x=sigCoefName[1:100])) | |
genes <- intersect(genes, rownames(assayData(expr)$exprs)) | |
m <- (assayData(expr)$exprs)[genes,] | |
heatmap(m, col=topo.colors(100), cexRow=0.4) | |
``` | |
Conclusion | |
---------- | |
Knitr is cool! | |
* mixes formatted text, code and graphics | |
* outputs HTML or PDF. | |
* produces documents programmatically, which means they can be shared, re-run and versioned. | |
* works with markdown, which I'm hoping will mesh nicely with the wiki features coming to Synapse. | |
* still has some kinks. I had issues with Latex equation rendering and debugging is hard. | |
* long running analyses should be pre-computed and stored in Synapse. | |
[1]: http://www.broadinstitute.org/software/cprg/?q=node/11 "CCLE" | |
[2]: http://www.broadinstitute.org/gsea/ "GSEA" | |
[3]: https://sagebionetworks.jira.com/wiki/display/SYNR/Home | |
[4]: http://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment