Skip to content

Instantly share code, notes, and snippets.

@jseabold
Created October 31, 2012 14:41
Show Gist options
  • Select an option

  • Save jseabold/3987408 to your computer and use it in GitHub Desktop.

Select an option

Save jseabold/3987408 to your computer and use it in GitHub Desktop.
Regression Diagnostics - Plotting
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "example_regression_plots"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas\n",
"import matplotlib.pyplot as plt\n",
"import statsmodels.api as sm\n",
"from statsmodels.formula.api import ols"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Duncan's Prestige Dataset"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Load the Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you have rpy installed and the car package for R, Pandas can load this data from R. Otherwise, we grab it from the net or the great <a href=\"http://vincentarelbundock.github.com/Rdatasets/\">Rdatasets package</a>."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"try:\n",
" from pandas.rpy import load_data\n",
" prestige = load_data('Duncan', 'car')\n",
"except:\n",
" url = 'http://eagle1.american.edu/~js2796a/Duncan.csv'\n",
" prestige = pandas.read_csv(url, index_col=0)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"prestige.head()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"prestige_model = ols(\"prestige ~ income + education\", data=prestige).fit()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print prestige_model.summary()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Influence plots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Influence plots show the (externally) studentized residuals vs. the leverage of each observation as measured by the hat matrix.\n",
"\n",
"Externally studentized residuals are residuals that are scaled by their standard deviation where \n",
"\n",
"$$var(\\\\hat{\\epsilon}_i)=\\hat{\\sigma}^2_i(1-h_{ii})$$\n",
"\n",
"with\n",
"\n",
"$$\\hat{\\sigma}^2_i=\\frac{1}{n - p - 1 \\;\\;}\\sum_{j}^{n}\\;\\;\\;\\forall \\;\\;\\; j \\neq i$$\n",
"\n",
"$n$ is the number of observations and $p$ is the number of regressors. $h_{ii}$ is the $i$-th diagonal element of the hat matrix\n",
"\n",
"$$H=X(X^{\\;\\prime}X)^{-1}X^{\\;\\prime}$$\n",
"\n",
"The influence of each point can be visualized by the criterion keyword argument. Options are Cook's distance and DFFITS, two measures of influence."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig, ax = plt.subplots(figsize=(12,8))\n",
"fig = sm.graphics.influence_plot(prestige_model, ax=ax, criterion=\"cooks\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see there are a few worrisome observations. Both contractor and reporter have low leverage but a large residual. <br />\n",
"RR.engineer has small residual and large leverage. Conductor and minister have both high leverage and large residuals, and, <br />\n",
"therefore, large influence."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Partial Regression Plots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we are doing multivariate regressions, we cannot just look at individual bivariate plots to discern relationships. <br />\n",
"Instead, we want to look at the relationship of the dependent variable and independent variables conditional on the other <br />\n",
"independent variables. We can do this through using partial regression plots, otherwise known as added variable plots. <br />\n",
"\n",
"In a partial regression plot, to discern the relationship between the response variable and the $k$-th variabe, we compute <br />\n",
"the residuals by regressing the response variable versus the independent variables excluding $X_k$. We can denote this by <br />\n",
"$X_{\\sim k}$. We then compute the residuals by regressing $X_k$ on $X_{\\sim k}$. The partial regression plot is the plot <br />\n",
"of the former versus the latter residuals. <br />\n",
"\n",
"The notable points of this plot are that the fitted line has slope $\\beta_k$ and intercept zero. The residuals of this plot <br />\n",
"are the same as those of the least squares fit of the original model with full $X$. You can discern the effects of the <br />\n",
"individual data values on the estimation of a coefficient easily. If obs_labels is True, then these points are annotated <br />\n",
"with their observation label. You can also see the violation of underlying assumptions such as homooskedasticity and <br />\n",
"linearity."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#TODO: This gives a poor error message from patsy, fix?\n",
"#sm.graphics.plot_partregress(\"prestige\", \"income\", [\"income\", \"educ\"], data=prestige)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fix, ax = plt.subplots(figsize=(12,14))\n",
"fig = sm.graphics.plot_partregress(\"prestige\", \"income\", [\"education\"], data=prestige, ax=ax)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see the partial regression plot confirms the influence of conductor, minister, and RR.engineer on the partial relationship between income and prestige. The cases greatly decrease the effect of income on prestige. Dropping these cases confirms this."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"subset = ~prestige.index.isin([\"conductor\", \"RR.engineer\", \"minister\"])\n",
"prestige_model2 = ols(\"prestige ~ income + education\", data=prestige, subset=subset).fit()\n",
"print prestige_model2.summary()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For a quick check of all the regressors, you can use plot_partregress_grid. These plots will not label the <br />\n",
"points, but you can use them to identify problems and then use plot_partregress to get more information."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig = plt.figure(figsize=(12,8))\n",
"fig = sm.graphics.plot_partregress_grid(prestige_model, fig=fig)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Component-Component plus Residual (CCPR) Plots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The CCPR plot provides a way to judge the effect of one regressor on the <br />\n",
"response variable by taking into account the effects of the other <br />\n",
"independent variables. The partial residuals plot is defined as <br />\n",
"$\\text{Residuals} + B_iX_i \\text{ }\\text{ }$ versus $X_i$. The component adds $B_iX_i$ versus <br />\n",
"$X_i$ to show where the fitted line would lie. Care should be taken if $X_i$ <br />\n",
"is highly correlated with any of the other independent variables. If this <br />\n",
"is the case, the variance evident in the plot will be an underestimate of <br />\n",
"the true variance."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig, ax = plt.subplots(figsize=(12, 8))\n",
"fig = sm.graphics.plot_ccpr(prestige_model, \"education\", ax=ax)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see the relationship between the variation in prestige explained by education conditional on income seems to be linear, though you can see there are some observations that are exerting considerable influence on the relationship. We can quickly look at more than one variable by using plot_ccpr_grid."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig = plt.figure(figsize=(12, 8))\n",
"fig = sm.graphics.plot_ccpr_grid(prestige_model, fig=fig)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Regression Plots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot_regress_exog function is a convenience function that gives a 2x2 plot containing, the dependent variable vs. the independent variable chosen, the residuals of the model vs. the chosen independent variable, the fitted values vs. the chosen independent variable, and the fitted values plus the unexplained variation versus the independent variabel. This function can be used for quickly checking modeling assumptions."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig = plt.figure(figsize=(12,8))\n",
"fig = sm.graphics.plot_regress_exog(prestige_model, \"education\", fig=fig)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Fit Plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot_fit function plots the fitted values versus a chosen independent variable. It includes prediction confidence intervals and optionally plots the true dependent variable."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig, ax = plt.subplots(figsize=(12, 8))\n",
"fig = sm.graphics.plot_fit(prestige_model, \"education\", ax=ax)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Statewide Crime 2009 Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare the following to http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter4/statareg_self_assessment_answers4.htm\n",
"\n",
"Though the data here is not the same as in that example. You could run that example by uncommenting the necessary below cells."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#dta = pandas.read_csv(\"http://www.stat.ufl.edu/~aa/social/csv_files/statewide-crime-2.csv\")\n",
"#dta = dta.set_index(\"State\", inplace=True).dropna()\n",
"#dta.rename(columns={\"VR\" : \"crime\",\n",
"# \"MR\" : \"murder\",\n",
"# \"M\" : \"pctmetro\",\n",
"# \"W\" : \"pctwhite\",\n",
"# \"H\" : \"pcths\",\n",
"# \"P\" : \"poverty\",\n",
"# \"S\" : \"single\"\n",
"# }, inplace=True)\n",
"#\n",
"#crime_model = ols(\"murder ~ pctmetro + poverty + pcths + single\", data=dta).fit()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dta = sm.datasets.statecrime.load_pandas().data"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"crime_model = ols(\"murder ~ urban + poverty + hs_grad + single\", data=dta).fit()\n",
"print crime_model.summary()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Partial Regression Plots"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig = plt.figure(figsize=(12,8))\n",
"fig = sm.graphics.plot_partregress_grid(crime_model, fig=fig)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig, ax = plt.subplots(figsize=(12,8))\n",
"fig = sm.graphics.plot_partregress(\"murder\", \"hs_grad\", [\"urban\", \"poverty\", \"single\"], ax=ax, data=dta)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Leverage-Resid<sup>2</sup> Plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Closely related to the influence_plot is the leverage-resid<sup>2</sup> plot."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig, ax = plt.subplots(figsize=(8,6))\n",
"fig = sm.graphics.plot_leverage_resid2(crime_model, ax=ax)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Influence Plot"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fig, ax = plt.subplots(figsize=(8,6))\n",
"fig = sm.graphics.influence_plot(crime_model, ax=ax)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Using robust regression to correct for outliers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Part of the problem here in recreating the Stata results is that M-estimators are not robust to leverage points. MM-estimators should do better with this examples."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from statsmodels.formula.api import rlm"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"rob_crime_model = rlm(\"murder ~ urban + poverty + hs_grad + single\", data=dta, \n",
" M=sm.robust.norms.TukeyBiweight(3)).fit(conv=\"weights\")\n",
"print rob_crime_model.summary()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#rob_crime_model = rlm(\"murder ~ pctmetro + poverty + pcths + single\", data=dta, M=sm.robust.norms.TukeyBiweight()).fit(conv=\"weights\")\n",
"#print rob_crime_model.summary()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TODO: Add Influence to RLM cf http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter4/statareg_self_assessment_answers4.htm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There aren't yet an influence diagnostics as part of RLM, but we can recreate them."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"weights = rob_crime_model.weights\n",
"idx = weights > 0\n",
"X = rob_crime_model.model.exog[idx]\n",
"ww = weights[idx] / weights[idx].mean()\n",
"hat_matrix_diag = ww*(X*np.linalg.pinv(X).T).sum(1)\n",
"resid = rob_crime_model.resid\n",
"resid2 = resid**2\n",
"resid2 /= resid2.sum()\n",
"nobs = int(idx.sum())\n",
"hm = hat_matrix_diag.mean()\n",
"rm = resid2.mean()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from statsmodels.graphics import utils\n",
"fig, ax = plt.subplots(figsize=(12,8))\n",
"ax.plot(resid2[idx], hat_matrix_diag, 'o')\n",
"ax = utils.annotate_axes(range(nobs), labels=rob_crime_model.model.data.row_labels[idx], \n",
" points=zip(resid2[idx], hat_matrix_diag), offset_points=[(-5,5)]*nobs,\n",
" size=\"large\", ax=ax)\n",
"ax.set_xlabel(\"resid2\")\n",
"ax.set_ylabel(\"leverage\")\n",
"ylim = ax.get_ylim()\n",
"ax.vlines(rm, *ylim)\n",
"xlim = ax.get_xlim()\n",
"ax.hlines(hm, *xlim)\n",
"ax.margins(0,0)"
],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment