Last active
May 27, 2016 21:24
-
-
Save jonathan-taylor/2b3ea263bbaf4e9f2b615838fde1ea97 to your computer and use it in GitHub Desktop.
Square-root LASSO with highdim knockoff data generating mechanism
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "from __future__ import division, print_function\n", | |
| "from IPython.display import HTML\n", | |
| "\n", | |
| "import numpy as np\n", | |
| "import statsmodels.api as sm\n", | |
| "%matplotlib inline\n", | |
| "import matplotlib.pyplot as plt\n", | |
| "\n", | |
| "# this code below is available at https://github.com/jonathan-taylor/selective-inference\n", | |
| "from selection.algorithms.lasso import lasso\n", | |
| "from selection.algorithms.sqrt_lasso import choose_lambda\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Data generating mechanism\n", | |
| "\n", | |
| "We use the same data generating mechanism as in [Barber and Candes (2016)](http://arxiv.org/abs/1602.03574)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "3.765534496582512e-15" | |
| ] | |
| }, | |
| "execution_count": 2, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "def cov(p, rho=0.25):\n", | |
| " idx = np.arange(p)\n", | |
| " return rho**np.fabs(np.subtract.outer(idx, idx))\n", | |
| "\n", | |
| "def sqrt_cov(p, rho=0.25):\n", | |
| " idx = np.arange(p)\n", | |
| " C = rho**np.fabs(np.subtract.outer(idx, idx))\n", | |
| " return np.linalg.cholesky(C)\n", | |
| "\n", | |
| "# Testing we have the square-root correct\n", | |
| "p = 2500\n", | |
| "A = cov(p, rho=0.3)\n", | |
| "B = sqrt_cov(p, rho=0.3)\n", | |
| "np.linalg.norm(B.dot(B.T) - A)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "cholesky_factors = {}\n", | |
| "for rho in [0, 0.25, 0.5, 0.75]:\n", | |
| " cholesky_factors[(rho, 2500)] = sqrt_cov(2500, rho=rho)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "def instance(n=2000, k=30, p=2500, signal=4.5, rho=0.25, sigma=1.234): # sigma there just to convince you \n", | |
| " # we don't need to know noise level\n", | |
| " if (rho, p) in cholesky_factors.keys():\n", | |
| " _sqrt_cov = cholesky_factors[(rho, p)]\n", | |
| " else:\n", | |
| " cholesky_factors[(rho, p)] = sqrt_cov(p, rho=rho)\n", | |
| " _sqrt_cov = cholesky_factors[(rho, p)]\n", | |
| " X = np.random.standard_normal((n, p)).dot(_sqrt_cov.T)\n", | |
| "\n", | |
| " X /= (np.sqrt((X**2).sum(0))) # like normc\n", | |
| " beta = np.zeros(p)\n", | |
| " beta[:k] = signal * (2 * np.random.binomial(1, 0.5, size=(k,)) - 1) \n", | |
| " np.random.shuffle(beta)\n", | |
| " Y = (X.dot(beta) + np.random.standard_normal(n)) * sigma\n", | |
| " true_active = np.nonzero(beta != 0)[0]\n", | |
| " return X, Y, true_active, beta\n", | |
| "\n", | |
| "X, Y, true_active, beta = instance()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Selective inference with square-root LASSO\n", | |
| "\n", | |
| "We will use the approach of [Tian, Loftus, Taylor (2015)](http://arxiv.org/abs/1504.08031) that does not\n", | |
| "require noise level to choose $\\lambda$. The p-values used are the ones given by the Gaussian approximation suggested in that paper. There are exact p-values but we have not computed them below. So, our null ones will not be \n", | |
| "exactly uniform, but the size of the 5% test is pretty close to 5%." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "We use a choice of $\\lambda$ as guided by theory. The function `choose_lambda` computes a Monte\n", | |
| "Carlo estimate of \n", | |
| "$$\n", | |
| "\\lambda = E\\left(\\frac{\\|X^T\\epsilon\\|_{\\infty}}{\\|\\epsilon\\|_2}\\right)\n", | |
| "$$\n", | |
| "(Actually it chooses the 95% quantile of the random variable by default, rather than the expectation.)\n", | |
| "\n", | |
| "In the paper on selective inference with the square-root LASSO, we see that for a given $\\lambda$ parameter, \n", | |
| "solving the square-root LASSO is equivalent to solving a LASSO with a rough estimate of $\\sigma$ (the estimate used by square-root LASSO) as well as an inflation \n", | |
| "factor depending on the design matrix. So using $\\kappa \\cdot \\lambda$ results in solving a LASSO\n", | |
| "with some other effective $\\lambda$.\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Let's let $\\alpha$ be the multiple of \n", | |
| "`choose_lambda(X)` we use. Where, roughly speaking\n", | |
| "`choose_lambda(X)` is\n", | |
| "$$\n", | |
| "\\lambda = E\\left(\\frac{\\|X^T\\epsilon\\|_{\\infty}}{\\|\\epsilon\\|_2}\\right)\n", | |
| "$$\n", | |
| "\n", | |
| "One reasonable way to choose this might be so that the expected size of the active set in a null model is some size. This is computationally expensive. We can easily get an idea of how many \"strong rules\" violators we have as a function of $\\kappa$ under a null model, by looking at \n", | |
| "$$\n", | |
| "\\# \\left\\{j: |X_j^T\\epsilon| / \\|\\epsilon\\|_2 > \\kappa \\lambda \\right\\}\n", | |
| "$$\n", | |
| "\n", | |
| "\n", | |
| "Or, we might just take some fixed multiple. In the square-root LASSO paper, I have been generally taking $\\kappa=0.7$ for various scenarios. It seems reasonable. No real justification for it." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 12, | |
| "metadata": { | |
| "collapsed": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "def simulate(rho, q=0.2, nsim=100, mult=0.5):\n", | |
| " P0, PA, active_size = [], [], []\n", | |
| " full_model_FDP, full_model_power, directional_FDP = {}, {}, {}\n", | |
| "\n", | |
| " for _ in range(nsim):\n", | |
| " X, Y, true_active, beta = instance(rho=rho)\n", | |
| " lam = choose_lambda(X)\n", | |
| " L = lasso.sqrt_lasso(X, Y, mult * lam)\n", | |
| " L.fit()\n", | |
| " S = L.summary('onesided')\n", | |
| " active_size.append(len(L.active))\n", | |
| " # keep p-values when screening \n", | |
| " \n", | |
| " if set(L.active).issuperset(true_active):\n", | |
| " P0.extend([p for p, v in zip(S['pval'], S['variable']) if v not in true_active])\n", | |
| " PA.extend([p for p, v in zip(S['pval'], S['variable']) if v in true_active])\n", | |
| " \n", | |
| " # what to do with the p-values\n", | |
| " \n", | |
| " methods = {'BH': lambda p: sm.stats.multipletests(S['pval'], q, 'fdr_bh')[0],\n", | |
| " 'BY': lambda p: sm.stats.multipletests(S['pval'], q, 'fdr_by')[0],\n", | |
| " 'thresh': lambda p: p < q}\n", | |
| " \n", | |
| " for key, method in methods.items():\n", | |
| " \n", | |
| " selected = methods[key](S['pval'])\n", | |
| " type1_errors = selected * np.array([v not in true_active for v in S['variable']])\n", | |
| " sign_errors = selected * np.array([(s != np.sign(beta[v])) * (v in true_active) for v, s in \n", | |
| " zip(S['variable'], L.active_signs)])\n", | |
| "\n", | |
| " full_model_FDP.setdefault(key, []).append(np.sum(type1_errors) / max(np.sum(selected), 1))\n", | |
| " directional_FDP.setdefault(key, []).append(np.sum(type1_errors + sign_errors) / max(np.sum(selected), 1))\n", | |
| " full_model_power.setdefault(key, []).append(np.sum(selected * np.array([v in true_active for v in S['variable']])) / len(true_active))\n", | |
| " \n", | |
| " return P0, PA, full_model_FDP, directional_FDP, full_model_power, active_size" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Size of active sets" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 17, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "[(0.4, 242.33333333333334), (0.7, 35.666666666666664), (1.0, 12.333333333333334)]\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print([(m, np.mean(simulate(0.25, nsim=3, mult=m)[-1])) for m in [0.4,0.7,1.]])" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 15, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "def make_table(nsim, q, mult):\n", | |
| "\n", | |
| " rows = r\"\"\"\n", | |
| " <tr><td>$\\rho$</td><td>Model FDP</td><td>Directional FDP</td><td>Model power</td><td>Active set</td></tr>\n", | |
| " \"\"\"\n", | |
| "\n", | |
| " for rho in [0., 0.25, 0.5, 0.75]:\n", | |
| " P0, PA, full_model_FDP, directional_FDP, full_model_power, active_size = simulate(rho=rho, q=q, nsim=nsim, mult=mult)\n", | |
| " rows += \"\"\"\n", | |
| " <tr><td>%0.2f</td><td>%0.3f</td><td>%0.3f</td><td>%0.3f</td><td>%0.1f</td></tr>\n", | |
| " \"\"\" % (rho, np.mean(full_model_FDP['BH']), np.mean(directional_FDP['BH']), np.mean(full_model_power['BH']), np.mean(active_size))\n", | |
| "\n", | |
| " table = \"\"\"\n", | |
| " <table>\n", | |
| " %s\n", | |
| " </table>\"\"\" % rows\n", | |
| "\n", | |
| " return table" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## $\\kappa=0.7$" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "\n", | |
| " <table>\n", | |
| " \n", | |
| " <tr><td>$\\rho$</td><td>Model FDP</td><td>Directional FDP</td><td>Model power</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.00</td><td>0.042</td><td>0.042</td><td>0.666</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.25</td><td>0.041</td><td>0.041</td><td>0.684</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.50</td><td>0.050</td><td>0.050</td><td>0.619</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.75</td><td>0.108</td><td>0.108</td><td>0.449</td></tr>\n", | |
| " \n", | |
| " </table>" | |
| ], | |
| "text/plain": [ | |
| "<IPython.core.display.HTML object>" | |
| ] | |
| }, | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "HTML(make_table(100, 0.2, 0.7))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## $\\kappa=0.4$\n", | |
| "\n", | |
| "You can of course choose $\\alpha$ so small so that power goes down. A multiple of 0.4 here seems pretty low and one can imagine that there are many more coefficients at this point to test." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "\n", | |
| " <table>\n", | |
| " \n", | |
| " <tr><td>$\\rho$</td><td>Model FDP</td><td>Directional FDP</td><td>Model power</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.00</td><td>0.083</td><td>0.083</td><td>0.038</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.25</td><td>0.046</td><td>0.046</td><td>0.021</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.50</td><td>0.087</td><td>0.087</td><td>0.045</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.75</td><td>0.112</td><td>0.112</td><td>0.051</td></tr>\n", | |
| " \n", | |
| " </table>" | |
| ], | |
| "text/plain": [ | |
| "<IPython.core.display.HTML object>" | |
| ] | |
| }, | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "HTML(make_table(100, 0.2, 0.4))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## $\\kappa=1.$\n", | |
| "\n", | |
| "If $\\kappa$ is too large, then too few variables will be selected. Power would be expected to go down because we have\n", | |
| "not included enough. Though FDP seems under control.\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "\n", | |
| " <table>\n", | |
| " \n", | |
| " <tr><td>$\\rho$</td><td>Model FDP</td><td>Directional FDP</td><td>Model power</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.00</td><td>0.001</td><td>0.001</td><td>0.240</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.25</td><td>0.000</td><td>0.000</td><td>0.248</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.50</td><td>0.013</td><td>0.013</td><td>0.254</td></tr>\n", | |
| " \n", | |
| " <tr><td>0.75</td><td>0.081</td><td>0.081</td><td>0.206</td></tr>\n", | |
| " \n", | |
| " </table>" | |
| ], | |
| "text/plain": [ | |
| "<IPython.core.display.HTML object>" | |
| ] | |
| }, | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "HTML(make_table(100, 0.2, 1.))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": true | |
| }, | |
| "outputs": [], | |
| "source": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": true | |
| }, | |
| "outputs": [], | |
| "source": [] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 2", | |
| "language": "python", | |
| "name": "python2" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 2 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython2", | |
| "version": "2.7.11" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 0 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment