Skip to content

Instantly share code, notes, and snippets.

@jtrive84
Created May 31, 2017 01:42
Show Gist options
  • Save jtrive84/10eeacbad852630cad90e071f5d15220 to your computer and use it in GitHub Desktop.
Save jtrive84/10eeacbad852630cad90e071f5d15220 to your computer and use it in GitHub Desktop.
Derivation of the Normal Equations using Least Squares and Maximum Likelihood
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Derivation of the Normal Equations \n",
"\n",
"The Normal Equations, represented in matrix form as\n",
"\n",
"\n",
"$$\n",
"(X^{T}X)\\hat{\\beta} = X^{T}y\n",
"$$\n",
"\n",
"are utilized in determining coefficent values associated with multiple linear regression models. The matrix representation is a compact form of of the full model specification, which is commonly represented as\n",
"\n",
"$$\n",
"y = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2} + \\cdots + \\beta_{k}x_{k} + \\varepsilon\n",
"$$\n",
"\n",
"where $\\varepsilon$ represents the error term, and \n",
"\n",
"$$\\sum_{i=1}^{n} \\varepsilon_{i} = 0.$$\n",
"\n",
"For a dataset with $n$ records by $k$ explanatory variables per record, the components of the Normal Equations are:\n",
"\n",
"* $\\hat{\\beta} = (\\hat{\\beta}_{0}, \\hat{\\beta}_{1},...,\\hat{\\beta}_{k})^{T}$, a vector of $(k+1)$ coefficents (one for each of the k explanatory variables plus one for the intercept term) \n",
"\n",
"* ${X}$, an $n$ by $(k+1)$-dimensional matrix of explanatory variables, with the first column consisting entirely of 1's \n",
"\n",
"* ${y} = (y_{1}, y_{2},...,y_{n})$, the response variable\n",
"\n",
"\n",
"The task is to solve for the $(k+1)$ $\\beta_{j}$'s such that $\\hat{\\beta}_{0}, \\hat{\\beta}_{1},...,\\hat{\\beta}_{k}$ minimize\n",
"\n",
"$$\n",
"\\sum_{i=1}^{n} \\hat{\\varepsilon}^{2}_{i} = \\sum_{i=1}^{n} (y_{i} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{i1} - \\hat{\\beta}_{2}x_{i2} - \\cdots - \\hat{\\beta}_{k}x_{ik})^2.\n",
"$$\n",
"\n",
"\n",
"The Normal Equations can be derived using both Least-Squares and Maximum likelihood Estimation. We'll demonstrate both approaches.\n",
"\n",
"\n",
"### Least-Squares Derivation\n",
"\n",
"An advantage of the Least-Squares approach is that no distributional assumption is necessary (unlike Maximum Likelihood Estimation). For $\\hat{\\beta}_{0}, \\hat{\\beta}_{1},...,\\hat{\\beta}_{k}$, we seek estimators that minimize the sum of squared deviations between the $n$ response variables and the predicted values, $\\hat{y}$. The objective is to minimize\n",
"\n",
"\n",
"$$\n",
"\\sum_{i=1}^{n} \\hat{\\varepsilon}^{2}_{i} = \\sum_{i=1}^{n} (y_{i} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{i1} - \\hat{\\beta}_{2}x_{i2} - \\cdots - \\hat{\\beta}_{k}x_{ik})^2.\n",
"$$\n",
"\n",
"\n",
"Using the more-compact matrix notation, our model can be represented as $y = X^{T}\\beta + \\varepsilon$. Isolating and squaring the error term yields\n",
"\n",
"$$\n",
"\\hat \\varepsilon^T \\hat \\varepsilon = \\sum_{i=1}^{n} (y - X\\hat{\\beta})^{T}(y - X\\hat{\\beta}).\n",
"$$\n",
"\n",
"Expanding the right-hand side and combining terms results in\n",
"\n",
"$$\n",
"\\hat \\varepsilon^T \\hat \\varepsilon = y^{T}y - 2y^{T}X\\hat{\\beta} + \\hat{\\beta}X^{T}X\\hat{\\beta}\n",
"$$\n",
"\n",
"\n",
"To find the value of $\\hat{\\beta}$ that minimizes $\\hat \\varepsilon^T \\hat \\varepsilon$, we differentiate $\\hat \\varepsilon^T \\hat \\varepsilon$ with respect to \n",
"$\\hat{\\beta}$, and set the result to zero:\n",
"\n",
"\n",
"$$\n",
"\\frac{\\partial \\hat{\\varepsilon}^{T}\\hat{\\varepsilon}}{\\partial \\hat{\\beta}} = -2X^{T}y + 2X^{T}X\\hat{\\beta} = 0\n",
"$$\n",
"\n",
"Which can then be solved for $\\hat{\\beta}$:\n",
"\n",
"\n",
"$$\n",
"\\hat{\\beta} = {(X^{T}X)}^{-1}{X}^{T}y\n",
"$$\n",
"\n",
"Since $\\hat{\\beta}$ minimizes the sum of squares, $\\hat{\\beta}$ is called the *Least-Squares Estimator.* \n",
" \n",
" \n",
" \n",
"### Maximum Likelihood Derivation\n",
"\n",
"For the Maximum Likelihood derivation, $X$, $y$ and $\\hat{\\beta}$ are the same as described in the Least-Squares derivation, and the model still follows the form\n",
"\n",
"$$\n",
"y = X^{T}\\beta + \\varepsilon\n",
"$$ \n",
"\n",
"but here we assume the $\\varepsilon_{i}$ are $iid$ and follow a zero-mean normal distribution:\n",
"\n",
"$$\n",
"N(\\varepsilon_{i}; 0, \\sigma^{2}) = \\frac{1}{\\sqrt{2\\pi\\sigma^{2}}} e^{- \\frac{(y_{i}-X^{T}\\hat{\\beta})^{2}}{2\\sigma^{2}}}.\n",
"$$\n",
"\n",
"In addition, the responses, $y_{i}$, are each assumed to follow a normal distribution. For $n$ observations, the likelihood function is\n",
"\n",
"$$\n",
"L(\\beta) = \\Big(\\frac{1}{\\sqrt{2\\pi\\sigma^{2}}}\\Big)^{n} e^{-(y-X\\beta)^{T}(y-X\\beta)/2\\sigma^{2}}.\n",
"$$\n",
"\n",
"\n",
"$Ln(L(\\beta))$, the Log-Likelihood, is therefore\n",
"\n",
"\n",
"$$\n",
"Ln(L(\\beta)) = -\\frac{n}{2}Ln(2\\pi) -\\frac{n}{2}Ln(\\sigma^{2})-\\frac{1}{2\\sigma^{2}}(y-X\\beta)^{T}(y-X\\beta).\n",
"$$\n",
"\n",
"Taking derivatives with respect to $\\beta$ and setting the result equal to zero results in\n",
"\n",
"\n",
"$$\n",
"\\frac{\\partial Ln(L(\\beta))}{\\partial \\beta} = -2X^{T}y -2X^{T}X\\beta = 0.\n",
"$$\n",
"\n",
"Upon rearranging and solving for $\\beta$, we obtain\n",
"\n",
"$$\n",
"\\hat{\\beta} = {(X^{T}X)}^{-1}{X}^{T}y,\n",
"$$\n",
"\n",
"which is identical to the result obtained from the Least-Squares approach. \n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment