Created
July 3, 2014 12:33
-
-
Save jamesmcm/6390b068ddf19d777f7f to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "metadata": { | |
| "celltoolbar": "Slideshow", | |
| "name": "", | |
| "signature": "sha256:68f7d12dcf5ee7656a300d7009496fbd40f486e1dcb06c46c9d48f015daa4dea" | |
| }, | |
| "nbformat": 3, | |
| "nbformat_minor": 0, | |
| "worksheets": [ | |
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "<script language=\"javascript\">\n", | |
| "\n", | |
| " function MouseRollover(MyImage) {\n", | |
| "\n", | |
| " MyImage.src = \"rfrplot.png\";\n", | |
| " \n", | |
| " }\n", | |
| " \n", | |
| " function MouseOut(MyImage) {\n", | |
| " MyImage.src = \"gprplot.png\";\n", | |
| " }\n", | |
| "</script>\n", | |
| "\n", | |
| "\n", | |
| "<center>\n", | |
| "\n", | |
| "<h2>DREAM9</h2>\n", | |
| "<h2>Acute Myeloid Leukemia Outcome Prediction Challenge</h2>\n", | |
| "<br>\n", | |
| "<h3>01/07/2014</h3>\n", | |
| "\n", | |
| "\n", | |
| "<h3> James McMurray </h3>\n", | |
| "PhD Student<br>\n", | |
| "MPI Intelligent Systems, T\u00fcbingen, Germany\n", | |
| "\n", | |
| "</center>\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "The DREAM Challenges\n", | |
| "------------------------\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* __D__ialogue for __R__everse __E__ngineering __A__ssessments and __M__ethods" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Challenges focus on Systems Biology" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Examples of previous challenges include inferring gene regulatory networks, and predicting breast cancer survival." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Sponsors:\n", | |
| " * Columbia University Center for Multiscale Analysis Genomic and Cellular Networks\n", | |
| " * IBM Computational Biology Center\n", | |
| " * The New York Academy of Sciences \n", | |
| " * NIH Roadmap Initiative\n", | |
| " * Sage Bionetworks" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "DREAM9 Challenges\n", | |
| "-------------------\n", | |
| "\n", | |
| "\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Three DREAM9 Challenges \n", | |
| "\n", | |
| " * Alzheimer\u2019s Disease Big Data DREAM Challenge \\#1 \n", | |
| " \n", | |
| " * The Broad-DREAM Gene Essentiality Prediction Challenge \n", | |
| " \n", | |
| " * The DREAM9 __Acute Myeloid Leukemia (AML) Outcome Prediction Challenge__ \n", | |
| " \n", | |
| " * Predict the outcome of treatment of AML patients (resistant or remission), their remission duration and overall survival based on clinical cytogentics, known genetics markers and phosphoproteomic data.\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Chosen as the tasks seemed more intuitive (doesn't require knowledge of medical imaging, etc.)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* No data access restrictions (unlike Alzheimer's disease challenge)\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "Acute Myeloid Leukemia\n", | |
| "------------------------\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Acute Myeloid Leukemia is a particularly lethal type of leukemia." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Affects the myeloid cells." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* In 2014, there is projected to be ~18,000 new cases of AML, and ~10,000 deaths from the disease. \n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Only approximately a quarter of the patients diagnosed with AML survive beyond 5 years." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "Acute Myeloid Leukemia Outcome Prediction Challenge\n", | |
| "-----------------------------------------------------\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Participants are given data on AML patients including 40 clinical correlates and the expression level of 231 proteins and phosphoproteins probed by reverse phase protein array analysis.\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Note that the expression levels include some missing data.\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Challenge consists of three sub-challenges:\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* __Subchallenge 1__: Determine the best model to predict which AML patients will have Complete Remission or will be Primary Resistant\n", | |
| " * Classification" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* __Subchallenge 2__: For patients who have Complete Remission, predict remission duration.\n", | |
| " * Regression" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* __Subchallenge 3__: Predict the overall survival time for each patient\n", | |
| " * Regression" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "Random Forests\n", | |
| "------------------------\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* An ensemble of decision trees trained on bootstrapped samples - can be used for classification and regression" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "<img src=\"./dtree.gif\" style=\"width: 500px;\"/>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Decision trees are trained by choosing splits which maximise information gain and minimise a loss function" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "Random Forest Regression example\n", | |
| "---------------------------------\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Can we just use this and be finished?" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "<img src=\"./rfrplot.png\" style=\"width: 500px;\"/>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Reasonable on observed data, but cannot make predictions outside of data range" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "Gaussian Process Regression\n", | |
| "----------------------------\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "<img id='hovim' src=\"./gprplot.png\" style=\"width: 500px;\" onMouseOver=\"MouseRollover(this)\" \n", | |
| "onMouseOut=\"MouseOut(this)\" />\n", | |
| "\n", | |
| "\n", | |
| "<!-- \n", | |
| "<img src=\"./gprplot.png\" style=\"width: 500px;\"/>\n", | |
| "-->" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Correct choice of assumptions allows prediction outside of input data range" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Provides uncertainty estimate too" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* How does it work?" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "Gaussian Processes\n", | |
| "----------------------------\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* A Bayesian non-parametric model" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Assumptions are encoded in the choice of _covariance function_\n", | |
| " * In previous example chose periodic covariance function" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Returns a distribution over functions - the family of functions is specified by the covariance function and its hyperparameters" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* The trick is in the correct choice of kernel function" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Can also be used to impute missing values - conceptually by creating a GP for the input dimensions with missing values against the others" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "slide" | |
| } | |
| }, | |
| "source": [ | |
| "Conclusion\n", | |
| "-------------\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Currently participating in the DREAM9 AML Outcome Prediction Challenge\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* There are three sub-challenges, including classification and regression tasks\n", | |
| "\n", | |
| "\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* For first submission used Random Forests for all tasks\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Gaussian Process approach seems promising:\n", | |
| " * Can impute missing data reasonably\n", | |
| " * Provides uncertainty estimates\n", | |
| " * With good choice of covariance function should be able to provide good predictions over a lot of the data space" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "* Main task lies in correctly choosing the covariance functions\n", | |
| " * Dealing with mix of categorical data, etc.\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "fragment" | |
| } | |
| }, | |
| "source": [ | |
| "<center>\n", | |
| "<strong>Thanks for you time!</strong>\n", | |
| "</center>\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "slideshow": { | |
| "slide_type": "skip" | |
| } | |
| }, | |
| "source": [ | |
| "TODO\n", | |
| "======\n", | |
| "\n", | |
| "* Why can't use simple autoencoder\n", | |
| "* Add motivation section at start - dimensionality reduction\n", | |
| "* Use of RBM's for pretraining\n", | |
| "* Visualisation\n", | |
| "* Data whitening, etc.\n", | |
| "* Actual example - pre-training makes linearly seperable\n", | |
| "* Font size\n", | |
| "\n", | |
| "ipython nbconvert pres.ipynb --to slides --post serve\n", | |
| "\n", | |
| "\n" | |
| ] | |
| } | |
| ], | |
| "metadata": {} | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment