Skip to content

Instantly share code, notes, and snippets.

@jamesmcm
Created July 3, 2014 12:33
Show Gist options
  • Select an option

  • Save jamesmcm/6390b068ddf19d777f7f to your computer and use it in GitHub Desktop.

Select an option

Save jamesmcm/6390b068ddf19d777f7f to your computer and use it in GitHub Desktop.
{
"metadata": {
"celltoolbar": "Slideshow",
"name": "",
"signature": "sha256:68f7d12dcf5ee7656a300d7009496fbd40f486e1dcb06c46c9d48f015daa4dea"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"<script language=\"javascript\">\n",
"\n",
" function MouseRollover(MyImage) {\n",
"\n",
" MyImage.src = \"rfrplot.png\";\n",
" \n",
" }\n",
" \n",
" function MouseOut(MyImage) {\n",
" MyImage.src = \"gprplot.png\";\n",
" }\n",
"</script>\n",
"\n",
"\n",
"<center>\n",
"\n",
"<h2>DREAM9</h2>\n",
"<h2>Acute Myeloid Leukemia Outcome Prediction Challenge</h2>\n",
"<br>\n",
"<h3>01/07/2014</h3>\n",
"\n",
"\n",
"<h3> James McMurray </h3>\n",
"PhD Student<br>\n",
"MPI Intelligent Systems, T\u00fcbingen, Germany\n",
"\n",
"</center>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"The DREAM Challenges\n",
"------------------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* __D__ialogue for __R__everse __E__ngineering __A__ssessments and __M__ethods"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Challenges focus on Systems Biology"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Examples of previous challenges include inferring gene regulatory networks, and predicting breast cancer survival."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Sponsors:\n",
" * Columbia University Center for Multiscale Analysis Genomic and Cellular Networks\n",
" * IBM Computational Biology Center\n",
" * The New York Academy of Sciences \n",
" * NIH Roadmap Initiative\n",
" * Sage Bionetworks"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"DREAM9 Challenges\n",
"-------------------\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Three DREAM9 Challenges \n",
"\n",
" * Alzheimer\u2019s Disease Big Data DREAM Challenge \\#1 \n",
" \n",
" * The Broad-DREAM Gene Essentiality Prediction Challenge \n",
" \n",
" * The DREAM9 __Acute Myeloid Leukemia (AML) Outcome Prediction Challenge__ \n",
" \n",
" * Predict the outcome of treatment of AML patients (resistant or remission), their remission duration and overall survival based on clinical cytogentics, known genetics markers and phosphoproteomic data.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Chosen as the tasks seemed more intuitive (doesn't require knowledge of medical imaging, etc.)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* No data access restrictions (unlike Alzheimer's disease challenge)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Acute Myeloid Leukemia\n",
"------------------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Acute Myeloid Leukemia is a particularly lethal type of leukemia."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Affects the myeloid cells."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* In 2014, there is projected to be ~18,000 new cases of AML, and ~10,000 deaths from the disease. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Only approximately a quarter of the patients diagnosed with AML survive beyond 5 years."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Acute Myeloid Leukemia Outcome Prediction Challenge\n",
"-----------------------------------------------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Participants are given data on AML patients including 40 clinical correlates and the expression level of 231 proteins and phosphoproteins probed by reverse phase protein array analysis.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Note that the expression levels include some missing data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Challenge consists of three sub-challenges:\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* __Subchallenge 1__: Determine the best model to predict which AML patients will have Complete Remission or will be Primary Resistant\n",
" * Classification"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* __Subchallenge 2__: For patients who have Complete Remission, predict remission duration.\n",
" * Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* __Subchallenge 3__: Predict the overall survival time for each patient\n",
" * Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Random Forests\n",
"------------------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* An ensemble of decision trees trained on bootstrapped samples - can be used for classification and regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"<img src=\"./dtree.gif\" style=\"width: 500px;\"/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Decision trees are trained by choosing splits which maximise information gain and minimise a loss function"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Random Forest Regression example\n",
"---------------------------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Can we just use this and be finished?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"<img src=\"./rfrplot.png\" style=\"width: 500px;\"/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Reasonable on observed data, but cannot make predictions outside of data range"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Gaussian Process Regression\n",
"----------------------------\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"<img id='hovim' src=\"./gprplot.png\" style=\"width: 500px;\" onMouseOver=\"MouseRollover(this)\" \n",
"onMouseOut=\"MouseOut(this)\" />\n",
"\n",
"\n",
"<!-- \n",
"<img src=\"./gprplot.png\" style=\"width: 500px;\"/>\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Correct choice of assumptions allows prediction outside of input data range"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Provides uncertainty estimate too"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* How does it work?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Gaussian Processes\n",
"----------------------------\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* A Bayesian non-parametric model"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Assumptions are encoded in the choice of _covariance function_\n",
" * In previous example chose periodic covariance function"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Returns a distribution over functions - the family of functions is specified by the covariance function and its hyperparameters"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* The trick is in the correct choice of kernel function"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Can also be used to impute missing values - conceptually by creating a GP for the input dimensions with missing values against the others"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Conclusion\n",
"-------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Currently participating in the DREAM9 AML Outcome Prediction Challenge\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* There are three sub-challenges, including classification and regression tasks\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* For first submission used Random Forests for all tasks\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Gaussian Process approach seems promising:\n",
" * Can impute missing data reasonably\n",
" * Provides uncertainty estimates\n",
" * With good choice of covariance function should be able to provide good predictions over a lot of the data space"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Main task lies in correctly choosing the covariance functions\n",
" * Dealing with mix of categorical data, etc.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"<center>\n",
"<strong>Thanks for you time!</strong>\n",
"</center>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"TODO\n",
"======\n",
"\n",
"* Why can't use simple autoencoder\n",
"* Add motivation section at start - dimensionality reduction\n",
"* Use of RBM's for pretraining\n",
"* Visualisation\n",
"* Data whitening, etc.\n",
"* Actual example - pre-training makes linearly seperable\n",
"* Font size\n",
"\n",
"ipython nbconvert pres.ipynb --to slides --post serve\n",
"\n",
"\n"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment