Created
April 15, 2019 22:04
-
-
Save embg/53951be5aa4ce84926ce0df34f8ea029 to your computer and use it in GitHub Desktop.
MAT 494 Final Project.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "MAT 494 Final Project.ipynb", | |
"version": "0.3.2", | |
"provenance": [], | |
"collapsed_sections": [], | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/embg/53951be5aa4ce84926ce0df34f8ea029/mat-494-final-project.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "_8lQxzaV6-3Y", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"# Setup\n", | |
"First, let's install and import some relevant packages:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "52yVubfKpNJ0", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"# Uncomment if you need any of these\n", | |
"# !pip install -U seaborn\n", | |
"# !pip install -U allensdk\n", | |
"# !pip install -U sklearn\n", | |
"# !pip install -U numpy\n", | |
"# !pip install -U scipy" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "m2zsoTy3PbMD", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"%matplotlib inline\n", | |
"import matplotlib.pyplot as plt\n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import seaborn as sns\n", | |
"import h5py\n", | |
"sns.set(font_scale=1.5)\n", | |
"from allensdk.core.brain_observatory_cache import BrainObservatoryCache" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "giMFI44g7UFM", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Next, we'll need to obtain the data via the Allen Institute's SDK. We chose experiment `510938357` because it was used in [this example from the Allen Institute](http://alleninstitute.github.io/AllenSDK/_static/examples/nb/brain_observatory_analysis.html)." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "3uxzsPWcEMd0", | |
"colab_type": "code", | |
"outputId": "043ec81f-c4bd-49c8-938d-1827603c4171", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 329 | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"# This should run fast unless you accidentally reset the runtime\n", | |
"boc = BrainObservatoryCache()\n", | |
"data_set = boc.get_ophys_experiment_data(510938357)\n", | |
"data_set.get_metadata()" | |
], | |
"execution_count": 58, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"{'age_days': 79,\n", | |
" 'cre_line': 'Rbp4-Cre_KL100/wt',\n", | |
" 'device': 'Nikon A1R-MP multiphoton microscope',\n", | |
" 'device_name': 'CAM2P.1',\n", | |
" 'excitation_lambda': '910 nanometers',\n", | |
" 'experiment_container_id': 511499656,\n", | |
" 'fov': '400x400 microns (512 x 512 pixels)',\n", | |
" 'genotype': 'Rbp4-Cre_KL100/wt;Camk2a-tTA/wt;Ai93(TITL-GCaMP6f)/Ai93(TITL-GCaMP6f)',\n", | |
" 'imaging_depth_um': 375,\n", | |
" 'indicator': 'GCaMP6f',\n", | |
" 'ophys_experiment_id': 510938357,\n", | |
" 'pipeline_version': '3.0',\n", | |
" 'session_start_time': datetime.datetime(2016, 3, 29, 13, 12, 6),\n", | |
" 'session_type': 'three_session_B',\n", | |
" 'sex': 'male',\n", | |
" 'specimen_name': 'Rbp4-Cre;Camk2a-tTA;Ai93-233442',\n", | |
" 'targeted_structure': 'VISal'}" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 58 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "MwBx0bUM7-dI", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Our X variable will be df/F traces measured from 70 cells in the VISal brain region of a mouse while it was presented with various visual stimuli." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "my_-y3MzGjbp", | |
"colab_type": "code", | |
"outputId": "f8711026-bcf3-49af-b1bb-5017bdd0e388", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 164 | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"_, response = data_set.get_dff_traces()\n", | |
"response = response.T # transpose so rows correspond to observations\n", | |
"print(response.shape)\n", | |
"print(response)" | |
], | |
"execution_count": 60, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"(114109, 70)\n", | |
"[[0.21781403 0.07487525 0.03365191 ... 0.29062796 0.06665727 0.20252168]\n", | |
" [0.20366953 0.12993708 0.1819539 ... 0.2583535 0.17013855 0.06673725]\n", | |
" [0.301449 0.11263548 0.14588039 ... 0.32233968 0.05344658 0.09116933]\n", | |
" ...\n", | |
" [0.14926389 0.10777875 0.09995854 ... 0.2433608 0.12114807 0.12035463]\n", | |
" [0.05776263 0.05191515 0.11781938 ... 0.27302468 0.12902136 0.15682107]\n", | |
" [0.12361121 0.13451324 0.09773071 ... 0.29064617 0.10175972 0.1981748 ]]\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "S0esIud38Z9_", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Our y variable will be the stimulus corresponding to each df/F trace. There are a few null values in the data corresponding to \"breaks\" where the mouse was shown a blank grey screen; we drop these, as our goal is to predict the static grating orientation." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "uYj2yJ-PE4YR", | |
"colab_type": "code", | |
"outputId": "344ba876-3b29-4297-ea49-529d0f6ea3d6", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 225 | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"# Our y variable will be the static grating shown at a given time\n", | |
"stimulus = data_set.get_stimulus_table('static_gratings').dropna()\n", | |
"print(stimulus.shape)\n", | |
"stimulus.head()" | |
], | |
"execution_count": 61, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"(5808, 5)\n" | |
], | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>orientation</th>\n", | |
" <th>spatial_frequency</th>\n", | |
" <th>phase</th>\n", | |
" <th>start</th>\n", | |
" <th>end</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>150.0</td>\n", | |
" <td>0.32</td>\n", | |
" <td>0.00</td>\n", | |
" <td>735</td>\n", | |
" <td>742</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>60.0</td>\n", | |
" <td>0.02</td>\n", | |
" <td>0.75</td>\n", | |
" <td>743</td>\n", | |
" <td>750</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>150.0</td>\n", | |
" <td>0.32</td>\n", | |
" <td>0.75</td>\n", | |
" <td>750</td>\n", | |
" <td>757</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>30.0</td>\n", | |
" <td>0.08</td>\n", | |
" <td>0.75</td>\n", | |
" <td>758</td>\n", | |
" <td>765</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>120.0</td>\n", | |
" <td>0.04</td>\n", | |
" <td>0.50</td>\n", | |
" <td>765</td>\n", | |
" <td>772</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" orientation spatial_frequency phase start end\n", | |
"0 150.0 0.32 0.00 735 742\n", | |
"1 60.0 0.02 0.75 743 750\n", | |
"2 150.0 0.32 0.75 750 757\n", | |
"3 30.0 0.08 0.75 758 765\n", | |
"4 120.0 0.04 0.50 765 772" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 61 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "4z9e1lDP80iP", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"The `start` and `end` columns above give slices in the `response` matrix for the traces corresponding to each static grating stimulus. Our first step in preparing the data will therefore be to group rows of the `response` matrix by their `[start : end]` slices." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "Curf0Ic7ta6A", | |
"colab_type": "code", | |
"outputId": "935ae809-4b12-4e5f-a30d-9a60e2756d6c", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"# Utility function to compute the slice for a given row of the stimulus DataFrame.\n", | |
"def time_slice(stim_idx):\n", | |
" row = stimulus.loc[stim_idx]\n", | |
" start, end = map(int, (row['start'], row['end']))\n", | |
" return range(start, end + 1)\n", | |
"\n", | |
"assert list(time_slice(4)) == list(range(765, 772 + 1))\n", | |
"assert response[time_slice(4)].shape == (772 - 765 + 1, 70)\n", | |
"\n", | |
"slices = [time_slice(idx) for idx in stimulus.index]\n", | |
"\n", | |
"# Utility function to apply a function of a slice to all response slices\n", | |
"def apply_to_slices(func):\n", | |
" return np.array([\n", | |
" func(response[sl], axis=0) for sl in slices\n", | |
" ])\n", | |
"\n", | |
"slices[:4]" | |
], | |
"execution_count": 62, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"[range(735, 743), range(743, 751), range(750, 758), range(758, 766)]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 62 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "yjlQz6UH9nkz", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"With the above infrastructure in place, we can compute the min- and max-values of the each neuron's response across each slice of time. This will allow the model to capture spike-features corresponding to large max values (or small min values after a spike). Note that `response` contains df/F values; i.e. we're looking at the min and max of the derivative of the signal, so it makes sense to ignore median values, which don't give us spike information.\n", | |
"\n", | |
"We build `X` by concatenating the min- and max-values over each response slice. This gives us a single vector for each stimulus, as opposed to a 2d matrix for each stimulus (the original data format)." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "ZnndMQJE5Tvz", | |
"colab_type": "code", | |
"outputId": "f23784f1-405f-4db4-aeba-0b8cc063e3ac", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"resp_min = apply_to_slices(np.min)\n", | |
"resp_max = apply_to_slices(np.max)\n", | |
"\n", | |
"X = np.concatenate([resp_min, resp_max], axis=1)\n", | |
"X.shape" | |
], | |
"execution_count": 64, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"(5808, 140)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 64 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "TYpcWruj-Viv", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Finally, we convert the numerical `orientation` column of the stimulus into 6 categorical variables, corresponding to 60-degree orientation increments starting at 0 degrees." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "4lsR7pbu6Tn1", | |
"colab_type": "code", | |
"outputId": "14a78372-85a4-4403-d74f-ca4a2487df39", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 54 | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"from sklearn.preprocessing import LabelEncoder\n", | |
"le = LabelEncoder()\n", | |
"y = le.fit_transform(stimulus['orientation'].values)\n", | |
"print(np.unique(y))\n", | |
"y.shape" | |
], | |
"execution_count": 65, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"[0 1 2 3 4 5]\n" | |
], | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"(5808,)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 65 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "t1t282jpA2Wt", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Note that each category in `y` corresponds to several different stimuli, with varying spatial frequencies and phases. The only thing the stimuli for each category have in common is the orientation of their static gratings! This will make it easier for us to conclude that the neurons are in fact responding to the `orientation` of lines in the gratings, and not to specific clusters of pixels in the image." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "q6e-cwRj-kvN", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"# Analysis\n", | |
"Let's import some data analysis packages." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "xeqAYH1O_R9p", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"from sklearn.model_selection import train_test_split\n", | |
"from sklearn.ensemble import RandomForestClassifier\n", | |
"from sklearn.model_selection import ShuffleSplit\n", | |
"from sklearn.metrics import accuracy_score\n", | |
"from scipy.stats import binom_test" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "Dl176fOA-qQl", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Our goal is to show that `X` depends on `y`; i.e., the firing patterns of neurons in the mouse's visual cortex depends on the orientation of the static grating it is looking at. \n", | |
"\n", | |
"Our null hypothesis is therefore that `X` and `y` are independent, which is equivalent to hypothesizing that if a classifier is fit to predict `X` from `y` on a random sample of input data, the variable L=\"is the accuracy better than 1/num_classes = 1/6\" can be modelled as being drawn from a Bernoulli distribution with `p=0.5`. \n", | |
"\n", | |
"To reject the null hypothesis, we will repeatedly fit a classifier predicting `X` from `y` on random train-test splits and compute the accuracy of the fit (an approximation to randomly sampling a value for L). We will then perform a Bernoulli test on our experimental values for L; if we find that L can't be modelled as a fair coin, we will have shown that `X` and `y` are dependent!" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "RqXUxTXZ4t2x", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"# Random Forest is a versatile model which we can use without worrying to much about tuning hyper-parameters\n", | |
"clf = RandomForestClassifier(n_estimators=100, \n", | |
" max_depth=3, \n", | |
" random_state=0)\n", | |
"\n", | |
"# Parameters for the experiment\n", | |
"max_num_trials = 100\n", | |
"rs = ShuffleSplit(n_splits = max_num_trials, test_size=.2, random_state=0)\n", | |
"num_successes = num_trials = 0" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "LXWeQ2VgA0Jm", | |
"colab_type": "code", | |
"outputId": "921b811e-5412-4595-f686-8ed6e0be6c32", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 733 | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"# Run the experiment\n", | |
"for train_index, test_index in rs.split(X):\n", | |
" num_trials += 1\n", | |
" print(f'Trial {num_trials} of at most {max_num_trials}.')\n", | |
"\n", | |
" # Split the data\n", | |
" X_train, X_test = X[train_index], X[test_index]\n", | |
" y_train, y_test = y[train_index], y[test_index]\n", | |
"\n", | |
" # Fit model on the train data\n", | |
" clf.fit(X_train, y_train)\n", | |
" \n", | |
" # Compute classification accuracy on the test data\n", | |
" y_pred = clf.predict(X_test)\n", | |
" accuracy = accuracy_score(y_test, y_pred) \n", | |
" print(f'Accuracy of the model on this split: {accuracy}')\n", | |
" num_successes += accuracy_score(y_test, y_pred) > 1/6\n", | |
" \n", | |
" # Compute p value for the null hypothesis: if X and y are independent, we expect to beat 1/6 accuracy about 50% of the time\n", | |
" null_hypothesis_probability = binom_test(num_successes, num_trials, p=0.5, alternative='greater')\n", | |
" print(f'Probability of the null hypothesis given the evidence thus far: {null_hypothesis_probability}')\n", | |
" if null_hypothesis_probability < 0.001:\n", | |
" break\n", | |
" print()\n" | |
], | |
"execution_count": 69, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Trial 1 of at most 100.\n", | |
"Accuracy of the model on this split: 0.21600688468158347\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.5\n", | |
"\n", | |
"Trial 2 of at most 100.\n", | |
"Accuracy of the model on this split: 0.21514629948364888\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.25\n", | |
"\n", | |
"Trial 3 of at most 100.\n", | |
"Accuracy of the model on this split: 0.20223752151462995\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.125\n", | |
"\n", | |
"Trial 4 of at most 100.\n", | |
"Accuracy of the model on this split: 0.18674698795180722\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.0625\n", | |
"\n", | |
"Trial 5 of at most 100.\n", | |
"Accuracy of the model on this split: 0.20654044750430292\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.03125\n", | |
"\n", | |
"Trial 6 of at most 100.\n", | |
"Accuracy of the model on this split: 0.21084337349397592\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.015625\n", | |
"\n", | |
"Trial 7 of at most 100.\n", | |
"Accuracy of the model on this split: 0.20740103270223753\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.0078125\n", | |
"\n", | |
"Trial 8 of at most 100.\n", | |
"Accuracy of the model on this split: 0.20395869191049915\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.00390625\n", | |
"\n", | |
"Trial 9 of at most 100.\n", | |
"Accuracy of the model on this split: 0.2306368330464716\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.001953125\n", | |
"\n", | |
"Trial 10 of at most 100.\n", | |
"Accuracy of the model on this split: 0.19707401032702238\n", | |
"Probability of the null hypothesis given the evidence thus far: 0.0009765625\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "jhhXRdVcB2xa", | |
"colab_type": "code", | |
"outputId": "65bf6b63-37c5-4a9e-bad4-a528b49fa00b", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"print(f'Probability of the null hypothesis given evidence from {num_trials} trials: {null_hypothesis_probability}') " | |
], | |
"execution_count": 70, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Probability of the null hypothesis given evidence from 10 trials: 0.0009765625\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "v51W1zFXAsOD", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Since the probability of the null hypothesis is less than `0.001`, we conclude with extremely high confidence that `X` depends on `y`!" | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment