jtrive84 · May 31, 2017 01:42
diff --git a/Normal_Equations.ipynb b/Normal_Equations.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Derivation of the Normal Equations   \n",
    "\n",
    "The Normal Equations, represented in matrix form as\n",
    "\n",
    "\n",
    "$$\n",
    "(X^{T}X)\\hat{\\beta} = X^{T}y\n",
    "$$\n",
    "\n",
    "are utilized in determining coefficent values associated with multiple linear regression models. The matrix representation is a compact form of of the full model specification, which is commonly represented as\n",
    "\n",
    "$$\n",
    "y = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2} + \\cdots + \\beta_{k}x_{k} + \\varepsilon\n",
    "$$\n",
    "\n",
    "where $\\varepsilon$ represents the error term, and \n",
    "\n",
    "$$\\sum_{i=1}^{n} \\varepsilon_{i} = 0.$$\n",
    "\n",
    "For a dataset with $n$ records by $k$ explanatory variables per record, the components of the Normal Equations are:\n",
    "\n",
    "*  $\\hat{\\beta} = (\\hat{\\beta}_{0}, \\hat{\\beta}_{1},...,\\hat{\\beta}_{k})^{T}$, a vector of $(k+1)$ coefficents (one for each  of the k explanatory variables plus one for the intercept term)     \n",
    "\n",
    "* ${X}$, an $n$ by $(k+1)$-dimensional matrix of explanatory variables, with the first column consisting entirely of 1's     \n",
    "\n",
    "* ${y} = (y_{1}, y_{2},...,y_{n})$, the response variable\n",
    "\n",
    "\n",
    "The task is to solve for the $(k+1)$ $\\beta_{j}$'s such that $\\hat{\\beta}_{0}, \\hat{\\beta}_{1},...,\\hat{\\beta}_{k}$ minimize\n",
    "\n",
    "$$\n",
    "\\sum_{i=1}^{n} \\hat{\\varepsilon}^{2}_{i} = \\sum_{i=1}^{n} (y_{i} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{i1} - \\hat{\\beta}_{2}x_{i2} - \\cdots - \\hat{\\beta}_{k}x_{ik})^2.\n",
    "$$\n",
    "\n",
    "\n",
    "The Normal Equations can be derived using both Least-Squares and Maximum likelihood Estimation. We'll demonstrate both approaches.\n",
    "\n",
    "\n",
    "### Least-Squares Derivation\n",
    "\n",
    "An advantage of the Least-Squares approach is that no distributional assumption is necessary (unlike Maximum Likelihood Estimation). For $\\hat{\\beta}_{0}, \\hat{\\beta}_{1},...,\\hat{\\beta}_{k}$, we seek estimators that minimize the sum of squared deviations between the $n$ response variables and the predicted values, $\\hat{y}$. The objective is to minimize\n",
    "\n",
    "\n",
    "$$\n",
    "\\sum_{i=1}^{n} \\hat{\\varepsilon}^{2}_{i} = \\sum_{i=1}^{n} (y_{i} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{i1} - \\hat{\\beta}_{2}x_{i2} - \\cdots - \\hat{\\beta}_{k}x_{ik})^2.\n",
    "$$\n",
    "\n",
    "\n",
    "Using the more-compact matrix notation, our model can be represented as $y = X^{T}\\beta + \\varepsilon$. Isolating and squaring the error term yields\n",
    "\n",
    "$$\n",
    "\\hat \\varepsilon^T \\hat \\varepsilon =  \\sum_{i=1}^{n} (y - X\\hat{\\beta})^{T}(y - X\\hat{\\beta}).\n",
    "$$\n",
    "\n",
    "Expanding the right-hand side and combining terms results in\n",
    "\n",
    "$$\n",
    "\\hat \\varepsilon^T \\hat \\varepsilon = y^{T}y - 2y^{T}X\\hat{\\beta} + \\hat{\\beta}X^{T}X\\hat{\\beta}\n",
    "$$\n",
    "\n",
    "\n",
    "To find the value of $\\hat{\\beta}$ that minimizes $\\hat \\varepsilon^T \\hat \\varepsilon$, we differentiate  $\\hat \\varepsilon^T \\hat \\varepsilon$ with respect to \n",
    "$\\hat{\\beta}$, and set the result to zero:\n",
    "\n",
    "\n",
    "$$\n",
    "\\frac{\\partial \\hat{\\varepsilon}^{T}\\hat{\\varepsilon}}{\\partial \\hat{\\beta}} = -2X^{T}y + 2X^{T}X\\hat{\\beta} = 0\n",
    "$$\n",
    "\n",
    "Which can then be solved for $\\hat{\\beta}$:\n",
    "\n",
    "\n",
    "$$\n",
    "\\hat{\\beta} = {(X^{T}X)}^{-1}{X}^{T}y\n",
    "$$\n",
    "\n",
    "Since $\\hat{\\beta}$ minimizes the sum of squares, $\\hat{\\beta}$ is called the *Least-Squares Estimator.* \n",
    "   \n",
    "      \n",
    "          \n",
    "### Maximum Likelihood Derivation\n",
    "\n",
    "For the Maximum Likelihood derivation, $X$, $y$ and $\\hat{\\beta}$ are the same as described in the Least-Squares derivation, and the model still follows the form\n",
    "\n",
    "$$\n",
    "y = X^{T}\\beta + \\varepsilon\n",
    "$$ \n",
    "\n",
    "but here we assume the $\\varepsilon_{i}$ are $iid$ and follow a zero-mean normal distribution:\n",
    "\n",
    "$$\n",
    "N(\\varepsilon_{i}; 0, \\sigma^{2}) = \\frac{1}{\\sqrt{2\\pi\\sigma^{2}}} e^{- \\frac{(y_{i}-X^{T}\\hat{\\beta})^{2}}{2\\sigma^{2}}}.\n",
    "$$\n",
    "\n",
    "In addition, the responses, $y_{i}$, are each assumed to follow a normal distribution. For $n$ observations, the likelihood function is\n",
    "\n",
    "$$\n",
    "L(\\beta) = \\Big(\\frac{1}{\\sqrt{2\\pi\\sigma^{2}}}\\Big)^{n} e^{-(y-X\\beta)^{T}(y-X\\beta)/2\\sigma^{2}}.\n",
    "$$\n",
    "\n",
    "\n",
    "$Ln(L(\\beta))$, the Log-Likelihood, is therefore\n",
    "\n",
    "\n",
    "$$\n",
    "Ln(L(\\beta)) = -\\frac{n}{2}Ln(2\\pi) -\\frac{n}{2}Ln(\\sigma^{2})-\\frac{1}{2\\sigma^{2}}(y-X\\beta)^{T}(y-X\\beta).\n",
    "$$\n",
    "\n",
    "Taking derivatives with respect to $\\beta$ and setting the result equal to zero results in\n",
    "\n",
    "\n",
    "$$\n",
    "\\frac{\\partial Ln(L(\\beta))}{\\partial \\beta} = -2X^{T}y -2X^{T}X\\beta = 0.\n",
    "$$\n",
    "\n",
    "Upon rearranging and solving for $\\beta$, we obtain\n",
    "\n",
    "$$\n",
    "\\hat{\\beta} = {(X^{T}X)}^{-1}{X}^{T}y,\n",
    "$$\n",
    "\n",
    "which is identical to the result obtained from the Least-Squares approach. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Derivation of the Normal Equations \n",
	"\n",
	"The Normal Equations, represented in matrix form as\n",
	"\n",
	"\n",
	"$$\n",
	"(X^{T}X)\\hat{\\beta} = X^{T}y\n",
	"$$\n",
	"\n",
	"are utilized in determining coefficent values associated with multiple linear regression models. The matrix representation is a compact form of of the full model specification, which is commonly represented as\n",
	"\n",
	"$$\n",
	"y = \\beta_{0} + \\beta_{1}x_{1} + \\beta_{2}x_{2} + \\cdots + \\beta_{k}x_{k} + \\varepsilon\n",
	"$$\n",
	"\n",
	"where $\\varepsilon$ represents the error term, and \n",
	"\n",
	"$$\\sum_{i=1}^{n} \\varepsilon_{i} = 0.$$\n",
	"\n",
	"For a dataset with $n$ records by $k$ explanatory variables per record, the components of the Normal Equations are:\n",
	"\n",
	"* $\\hat{\\beta} = (\\hat{\\beta}_{0}, \\hat{\\beta}_{1},...,\\hat{\\beta}_{k})^{T}$, a vector of $(k+1)$ coefficents (one for each of the k explanatory variables plus one for the intercept term) \n",
	"\n",
	"* ${X}$, an $n$ by $(k+1)$-dimensional matrix of explanatory variables, with the first column consisting entirely of 1's \n",
	"\n",
	"* ${y} = (y_{1}, y_{2},...,y_{n})$, the response variable\n",
	"\n",
	"\n",
	"The task is to solve for the $(k+1)$ $\\beta_{j}$'s such that $\\hat{\\beta}_{0}, \\hat{\\beta}_{1},...,\\hat{\\beta}_{k}$ minimize\n",
	"\n",
	"$$\n",
	"\\sum_{i=1}^{n} \\hat{\\varepsilon}^{2}_{i} = \\sum_{i=1}^{n} (y_{i} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{i1} - \\hat{\\beta}_{2}x_{i2} - \\cdots - \\hat{\\beta}_{k}x_{ik})^2.\n",
	"$$\n",
	"\n",
	"\n",
	"The Normal Equations can be derived using both Least-Squares and Maximum likelihood Estimation. We'll demonstrate both approaches.\n",
	"\n",
	"\n",
	"### Least-Squares Derivation\n",
	"\n",
	"An advantage of the Least-Squares approach is that no distributional assumption is necessary (unlike Maximum Likelihood Estimation). For $\\hat{\\beta}_{0}, \\hat{\\beta}_{1},...,\\hat{\\beta}_{k}$, we seek estimators that minimize the sum of squared deviations between the $n$ response variables and the predicted values, $\\hat{y}$. The objective is to minimize\n",
	"\n",
	"\n",
	"$$\n",
	"\\sum_{i=1}^{n} \\hat{\\varepsilon}^{2}_{i} = \\sum_{i=1}^{n} (y_{i} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{i1} - \\hat{\\beta}_{2}x_{i2} - \\cdots - \\hat{\\beta}_{k}x_{ik})^2.\n",
	"$$\n",
	"\n",
	"\n",
	"Using the more-compact matrix notation, our model can be represented as $y = X^{T}\\beta + \\varepsilon$. Isolating and squaring the error term yields\n",
	"\n",
	"$$\n",
	"\\hat \\varepsilon^T \\hat \\varepsilon = \\sum_{i=1}^{n} (y - X\\hat{\\beta})^{T}(y - X\\hat{\\beta}).\n",
	"$$\n",
	"\n",
	"Expanding the right-hand side and combining terms results in\n",
	"\n",
	"$$\n",
	"\\hat \\varepsilon^T \\hat \\varepsilon = y^{T}y - 2y^{T}X\\hat{\\beta} + \\hat{\\beta}X^{T}X\\hat{\\beta}\n",
	"$$\n",
	"\n",
	"\n",
	"To find the value of $\\hat{\\beta}$ that minimizes $\\hat \\varepsilon^T \\hat \\varepsilon$, we differentiate $\\hat \\varepsilon^T \\hat \\varepsilon$ with respect to \n",
	"$\\hat{\\beta}$, and set the result to zero:\n",
	"\n",
	"\n",
	"$$\n",
	"\\frac{\\partial \\hat{\\varepsilon}^{T}\\hat{\\varepsilon}}{\\partial \\hat{\\beta}} = -2X^{T}y + 2X^{T}X\\hat{\\beta} = 0\n",
	"$$\n",
	"\n",
	"Which can then be solved for $\\hat{\\beta}$:\n",
	"\n",
	"\n",
	"$$\n",
	"\\hat{\\beta} = {(X^{T}X)}^{-1}{X}^{T}y\n",
	"$$\n",
	"\n",
	"Since $\\hat{\\beta}$ minimizes the sum of squares, $\\hat{\\beta}$ is called the Least-Squares Estimator. \n",
	" \n",
	" \n",
	" \n",
	"### Maximum Likelihood Derivation\n",
	"\n",
	"For the Maximum Likelihood derivation, $X$, $y$ and $\\hat{\\beta}$ are the same as described in the Least-Squares derivation, and the model still follows the form\n",
	"\n",
	"$$\n",
	"y = X^{T}\\beta + \\varepsilon\n",
	"$$ \n",
	"\n",
	"but here we assume the $\\varepsilon_{i}$ are $iid$ and follow a zero-mean normal distribution:\n",
	"\n",
	"$$\n",
	"N(\\varepsilon_{i}; 0, \\sigma^{2}) = \\frac{1}{\\sqrt{2\\pi\\sigma^{2}}} e^{- \\frac{(y_{i}-X^{T}\\hat{\\beta})^{2}}{2\\sigma^{2}}}.\n",
	"$$\n",
	"\n",
	"In addition, the responses, $y_{i}$, are each assumed to follow a normal distribution. For $n$ observations, the likelihood function is\n",
	"\n",
	"$$\n",
	"L(\\beta) = \\Big(\\frac{1}{\\sqrt{2\\pi\\sigma^{2}}}\\Big)^{n} e^{-(y-X\\beta)^{T}(y-X\\beta)/2\\sigma^{2}}.\n",
	"$$\n",
	"\n",
	"\n",
	"$Ln(L(\\beta))$, the Log-Likelihood, is therefore\n",
	"\n",
	"\n",
	"$$\n",
	"Ln(L(\\beta)) = -\\frac{n}{2}Ln(2\\pi) -\\frac{n}{2}Ln(\\sigma^{2})-\\frac{1}{2\\sigma^{2}}(y-X\\beta)^{T}(y-X\\beta).\n",
	"$$\n",
	"\n",
	"Taking derivatives with respect to $\\beta$ and setting the result equal to zero results in\n",
	"\n",
	"\n",
	"$$\n",
	"\\frac{\\partial Ln(L(\\beta))}{\\partial \\beta} = -2X^{T}y -2X^{T}X\\beta = 0.\n",
	"$$\n",
	"\n",
	"Upon rearranging and solving for $\\beta$, we obtain\n",
	"\n",
	"$$\n",
	"\\hat{\\beta} = {(X^{T}X)}^{-1}{X}^{T}y,\n",
	"$$\n",
	"\n",
	"which is identical to the result obtained from the Least-Squares approach. \n",
	"\n",
	"\n",
	"\n",
	"\n",
	"\n",
	"\n",
	"\n"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}