jtrive84 · May 31, 2017 01:40
diff --git a/Logistic_Regression_Coefficient_Estimation.ipynb b/Logistic_Regression_Coefficient_Estimation.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Estimating Logistic Regression Coefficents Manually in R\n",
    "\n",
    "          \n",
    "In this post, we'll highlight the parameter estimation routines that are called behind the scences upon invocation of R's `glm` function. Specifically, we'll focus on how parameters of a Logistic Regression model are estimated when fit to data with a dicotomous response. \n",
    "\n",
    "\n",
    "R's `glm` function is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution [1](#Footnotes:). This function conceals a good deal of the complexity behind a simple interface, making it easy to overlook the calculations that determine a model's coefficents. The goal of this post is to shed some light on the setup and execution of those calcuations. \n",
    "  \n",
    "\n",
    "\n",
    "## Background\n",
    "\n",
    "In a Generalized Linear Model, the response may have any distribution from the exponential family, and rather than assuming the mean is a linear function of the explnatory variables, we assume that a function of the mean, or the *link function*, is a linear function of the explnatory variables.\n",
    "      \n",
    "Logistic Regression is used for modeling data with a categorical response. Although it's possible to model multinomial data using Logistic Regression, in this post we'll limit our analysis to models having a dichotomous response, where the outcome can be classified as 'Yes/No', 'True/False', '1/0', 'Good/Bad', etc...\n",
    "       \n",
    "The Logistic Regression model is a Generalized Linear Model whose canonical link is the logit, or log-odds:\n",
    "\n",
    "$$\n",
    "Ln \\Big(\\frac{\\pi_{i}}{1 - \\pi_{i}} \\Big) = \\beta_{0} + \\beta_{1}{x}_{i1} + \\cdots + \\beta_{p}{x}_{ip}\n",
    "$$\n",
    "\n",
    "for $i = (1, \\cdots , n)$.\n",
    "\n",
    "Solving the logit for $\\pi_{i}$, which is a stand-in for the predicted probability associated with $x_{i}$, yields\n",
    "\n",
    "$$\n",
    "\\pi_{i} = \\frac {e^{\\beta_{0} + \\beta_{1}{x}_{i1} + \\cdots + \\beta_{p}{x}_{ip}}}{1 + e^{\\beta_{0} + \\beta_{1}{x}_{i1} + \\cdots + \\beta_{p}{x}_{ip}}} = \\frac {1}{1 + e^{-(\\beta_{0} + \\beta_{1}{x}_{i1} + \\cdots + \\beta_{p}{x}_{ip})}},\n",
    "$$\n",
    "\n",
    "where  $-\\infty<x_{i}<+\\infty$ and $0<\\pi_{i}<1$.\n",
    "\n",
    "In other words, the expression for $\\pi_{i}$ maps any real-valued $x_{i}$ to a positive probability between 0 and 1.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Parameter Estimation\n",
    "\n",
    "Maximum Likelihood Estimation can be used to determine the parameters of a Logistic Regression model, which entails finding the set of parameters for which the probability of the observed data is greatest. The objective is to estimate the $(p+1)$ unknown $\\beta_{0}, \\cdots ,\\beta_{p}$.\n",
    "\n",
    "Let $Y_{i}$ represent independent, dicotomous response values for each of $n$ observations, where $Y_{i}=1$ denotes a success and $Y_{i}=0$ denotes a failure. The density function of a single observation $Y_{i}$ is given by\n",
    "\n",
    "$$\n",
    "p(y_{i}) = \\pi_{i}^{y_{i}}(1-\\pi_{i})^{1-y_{i}}\n",
    ",$$\n",
    "    \n",
    "and the corresponding Likelihood function is\n",
    "\n",
    "$$\n",
    "L(\\beta) = \\prod_{i=1}^{n} \\pi_{i}^{y_{i}}(1-\\pi_{i})^{1-y_{i}}.\n",
    "$$\n",
    "\n",
    "\n",
    "Taking the natural log of the Maximum Likelihood Estimate results in the Log-Likelihood function:\n",
    "\n",
    "$$\n",
    "l(\\beta) = Ln(L(\\beta)) = Ln \\Big(\\prod_{i=1}^{n} \\pi_{i}^{y_{i}}(1-\\pi_{i})^{1-y_{i}} \\Big)\n",
    "= \\sum_{i=1}^{n} y_{i} Ln(\\pi_{i}) + (1-y_{i})Ln(1-\\pi_{i})\n",
    "$$\n",
    "\n",
    "$$\n",
    "= \\sum_{i=1}^{n} y_{i} Ln \\Big(\\frac {e^{\\beta_{0} + \\beta_{1}{x}_{i1} + \\cdots + \\beta_{p}{x}_{ip}}}{1 + e^{\\beta_{0} + \\beta_{1}{x}_{i1} + \\cdots + \\beta_{p}{x}_{ip}}} \\Big) + (1 - y_{i}) Ln \\Big(\\frac {1}{1 + e^{\\beta_{0} + \\beta_{1}{x}_{i1} + \\cdots + \\beta_{p}{x}_{ip}}} \\Big)\n",
    "$$\n",
    "\n",
    "$$\n",
    "= \\sum_{i=1}^{n} y_{i}(\\beta_{0} + \\beta_{1}{x}_{i1} + \\cdots + \\beta_{p}{x}_{ip}) - Ln(1 + e^{\\beta_{0} + \\beta_{1}{x}_{i1} + \\cdots + \\beta_{p}{x}_{ip}}) \\hspace{1.5cm}(*)\n",
    "$$\n",
    "\n",
    "The first-order partial derivatives of the Log-Likelihood are calculated and set to zero for each $\\beta_{k}$, $k = 0, 1, \\cdots, p$\n",
    "\n",
    "$$\n",
    "\\frac {\\partial l(\\beta)}{\\partial \\beta_{k}} = \\sum_{i=1}^{n} y_{i}x_{ik} - \\pi_{i}x_{ik} = \\sum_{i=1}^{n} x_{ik}(y_{i} - \\pi_{i}) = 0,\n",
    "$$\n",
    "\n",
    "which can be represented in matrix notation as\n",
    "\n",
    "$$\n",
    "\\frac {\\partial l(\\beta)}{\\partial \\beta} = X^{T}(y - \\pi),\n",
    "$$\n",
    "\n",
    "where $X^{T}$ is a $(p+1)$-by-$n$ matrix, and $(y - \\pi)$ is a $n$-by-$1$ vector.\n",
    "\n",
    "The vector of first-order partial derivatives of the Log-Likelihood function is referred to as the *score function* in statistical literature, and is typically represented as $U$.\n",
    "     \n",
    "These $(p+1)$ equations are solved simultaneously to obtain the parameter estimates $\\hat\\beta_{0}, \\cdots ,\\hat\\beta_{p}$. Each solution specifies a critical-point which will be either a maximum or a minimum. The critical point will be a maximum if the matrix of second partial derivatives is negative definite (which means every element on the diagonal of the matrix is less than zero). \n",
    "    \n",
    "The matrix of second partial derivatives is given by\n",
    "\n",
    "$$\n",
    "\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta_{k}}{\\partial \\beta_{k}}^{T}} = - \\sum_{i=1}^{n} x_{ik}\\pi_{i}(1-\\pi_{i}){x_{ik}}^{T},\n",
    "$$\n",
    "\n",
    "represented in matrix form as:\n",
    "\n",
    "$$\n",
    "\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}} = -X^{T}WX,\n",
    "$$\n",
    "\n",
    "where $W$ is an $n$-by-$n$ diagonal matrix of weights with each element equal to $\\pi_{i}(1 - \\pi_{i})$ for Logistic Regression models (in general, the weights matrix $W$ will have entries inversely proportional to the variance of the response). \n",
    "\n",
    "\n",
    "Since no closed-form solution exists for determining Logistic Regression model coefficents (as exists for Linear Regression models), iterative techniques must be employed. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Fitting the Model\n",
    "\n",
    "Two distinct but related iterative methods can be utilized in determining model coefficents: the *Newton-Raphson method* and *Fisher Scoring*. The Newton-Raphson method relies on the matrix of second partial derivatives, also known as the [Hessian](https://en.wikipedia.org/wiki/Hessian_matrix). The Newton-Raphson update formula is:\n",
    "\n",
    "\n",
    "$$\n",
    "\\beta^{(t+1)} = \\beta^{(t)} - (H^{(t)})^{-1}U^{(t)}\n",
    "$$\n",
    "\n",
    "\n",
    "where:\n",
    "\n",
    "* $\\beta^{(t+1)}$ = the vector of updated coefficent estimates\n",
    "* $\\beta^{(t)}$ = the vector of coefficent estimates from the previous iteration\n",
    "* $(H^{(t)})^{-1}$ = the inverse of the Hessian, $\\Big(\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}\\Big)^{-1}$\n",
    "* $U^{(t)}$ = the vector of first-order partial derivatives of the Log-Likelihood function, $\\frac {\\partial l(\\beta)}{\\partial \\beta}$ = $X^{T}(y - \\pi)$\n",
    "\n",
    "\n",
    "The Newton-Raphson method starts with an initial guess for the solution, and obtains a second guess by approximating the function to be maximized in a neighborhood of the initial guess by a second-degree polynomial, and then finding the location of that polynomial's maximum value. This process continues until it converges to the actual solution. The convergence of $\\beta^{t}$ to $\\hat{\\beta}$ is usually fast, with adequate convergence realized after 4-5 iterations [2](#Footnotes:).\n",
    "       \n",
    "*Fisher Scoring* utilizes the *expected information*, $-E\\Big(\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}\\Big)$. Let $\\mathcal{I}$ serve as a stand-in for the expected value of the information:\n",
    "\n",
    "$$\n",
    "\\mathcal{I} = -E\\Big(\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}\\Big).\n",
    "$$\n",
    "    \n",
    "    \n",
    "Then, the Fisher Scoring update step replaces $-H^{(t)}$ from Newton-Raphson with $\\mathcal{I}^{(t)}$:\n",
    "\n",
    "\n",
    "$$\n",
    "\\beta^{(t+1)} = \\beta^{(t)} + (\\mathcal{I}^{(t)})^{-1}U^{(t)},\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\hspace{2.3cm} = \\beta^{(t)} + (X^{T}WX)^{-1}X^{T}(y - \\pi),\n",
    "$$\n",
    "\n",
    "\n",
    "where:\n",
    "\n",
    "* $\\beta^{(t+1)}$ = the vector of updated coefficent estimates\n",
    "* $\\beta^{(t)}$ = the vector of coefficent estimates from the previous iteration\n",
    "* $(\\mathcal{I}^{(t)})^{-1}$ = the inverse of the expected information matrix, $-E \\Big(\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}\\Big)^{-1}$\n",
    "* $U^{(t)}$ = the vector of first-order partial derivatives of the Log-Likelihood function, $\\frac {\\partial l(\\beta)}{\\partial \\beta}$ = $X^{T}(y - \\pi)$\n",
    "\n",
    "\n",
    "Iteration continues until $\\beta^{(t)}$ stabilizes.\n",
    "  \n",
    "  \n",
    "For GLM's with a canonical link (of which employing the logit for Logistic Regression is an example), **the observed and expected information are the same**.\n",
    "When the response follows an exponential family distribution, and the canonical link function is employed, observed and expected Information coincide so that Fisher Scoring is the same as Newton-Raphson. \n",
    "\n",
    "\n",
    "When the canonical link is used, the second partial derivatives of the Log-Likelihood do not depend on the observation $y_{i}$, and therefore\n",
    "\n",
    "$$\n",
    "\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}} = E \\Big(\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}} \\Big).\n",
    "$$\n",
    "\n",
    "Fisher scoring has the advantage that it produces the asymptotic covariance matrix as a by-product.\n",
    "      \n",
    "\n",
    "To clarify:\n",
    "\n",
    "* The *Hessian* is the matrix of second partial derivatives of the Log-Likelihood with respect to the parameters, or $H = \\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}$.\n",
    "\n",
    "* The *observed information* is $-\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}$.\n",
    "\n",
    "* The *expected information* is $\\mathcal{I} = E\\Big(-\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}\\Big)$.\n",
    "\n",
    "* The *asymptotic covariance matrix* is $Var(\\hat{\\beta}) = E\\Big(-\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}\\Big)^{-1} = (X^{T}WX)^{-1}$.\n",
    "     \n",
    "          \n",
    "For models employing a canonical link function:\n",
    "\n",
    "* The observed and expected information are the same, $\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}} = E\\Big(\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}\\Big)$.\n",
    "\n",
    "* $H = -\\mathcal{I}$, or $\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}} = E\\Big(-\\frac{\\partial^{2} l(\\beta)}{{\\partial \\beta}{\\partial \\beta}^{T}}\\Big)$.\n",
    "\n",
    "* The Newton-Raphson and Fisher Scoring algorithms yield identical results.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Fisher Scoring Implementation in R\n",
    "\n",
    "The data used for our sample calculation can be obtained [here](https://gist.github.com/jtrive84/835514a76f7afd552c999e4d9134baa8). This data represents O-Ring failures in the 23 pre-*Challenger* space shuttle missions. In this dataset, `TEMPERATURE` will serve as the single explnatory variable which will be used to predict `O_RING_FAILURE`, which is `1` if a failure occurred, `0` otherwise.\n",
    "\n",
    "Once the parameters have been determined, the model estimate of the probability of success for a given observation can be calculated with:\n",
    "\n",
    "$$\n",
    "\\hat\\pi_{i} = \\frac {e^{\\hat\\beta_{0} + \\hat\\beta_{1}{x}_{i1} + \\cdots + \\hat\\beta_{p}{x}_{ip}}}{1 + e^{\\hat\\beta_{0} + \\hat\\beta_{1}{x}_{i1} + \\cdots + \\hat\\beta_{p}{x}_{ip}}}\n",
    "$$\n",
    "\n",
    "    \n",
    "In the following code segment, we define a single function, `getCoefficients`, which returns the estimated model coefficents as a $(p+1)$-by-$1$ matrix. In addition, the function returns the number of scoring iterations, fitted values and the variance-covariance matrix for the estimated coefficients:\n",
    "\n",
    "```R\n",
    "getCoefficients <- function(design_matrix, response_vector, epsilon=.0001) {\n",
    "    # =========================================================================\n",
    "    # design_matrix      `X`     => n-by-(p+1)                                |\n",
    "    # response_vector    `y`     => n-by-1                                    |\n",
    "    # probability_vector `p`     => n-by-1                                    |\n",
    "    # weights_matrix     `W`     => n-by-n                                    |\n",
    "    # epsilon                    => threshold above which iteration continues |\n",
    "    # =========================================================================\n",
    "    # n                          => # of observations                         |\n",
    "    # (p + 1)                    => # of parameterss, +1 for intercept term   |\n",
    "    # =========================================================================\n",
    "    # U => First derivative of Log-Likelihood with respect to                 |\n",
    "    #      each beta_i, i.e. `Score Function`: X_transpose * (y - p)          |\n",
    "    #                                                                         |\n",
    "    # I => Second derivative of Log-Likelihood with respect to                |\n",
    "    #      each beta_i. The `Information Matrix`: (X_transpose * W * X)       |\n",
    "    #                                                                         |\n",
    "    # X^T*W*X results in a (p+1)-by-(p+1) matrix                              |\n",
    "    # X^T(y - p) results in a (p+1)-by-1 matrix                               |\n",
    "    # (X^T*W*X)^-1 * X^T(y - p) results in a (p+1)-by-1 matrix                |\n",
    "    # ========================================================================|\n",
    "    X <- as.matrix(design_matrix)\n",
    "    y <- as.matrix(response_vector)\n",
    "    \n",
    "    # initialize logistic function used for Scoring calculations =>\n",
    "    pi_i <- function(v) return(exp(v)/(1 + exp(v)))\n",
    "\n",
    "    # initialize beta_0, p_0, W_0, I_0 & U_0 =>\n",
    "    beta_0 <- matrix(rep(0, ncol(X)), nrow=ncol(X), ncol=1, byrow=FALSE, dimnames=NULL)\n",
    "    p_0    <- pi_i(X %*% beta_0)\n",
    "    W_0    <- diag(as.vector(p_0*(1-p_0)))\n",
    "    I_0    <- t(X) %*% W_0 %*% X\n",
    "    U_0    <- t(X) %*% (y - p_0)\n",
    "\n",
    "    # initialize variables for iteration =>\n",
    "    beta_old                   <- beta_0\n",
    "    iter_I                     <- I_0\n",
    "    iter_U                     <- U_0\n",
    "    iter_p                     <- p_0\n",
    "    iter_W                     <- W_0\n",
    "    fisher_scoring_iterations  <- 0\n",
    "    \n",
    "    # iterate until difference between abs(beta_new - beta_old) < epsilon =>\n",
    "    while(TRUE) {\n",
    "        \n",
    "        # Fisher Scoring Update Step =>\n",
    "        fisher_scoring_iterations <- fisher_scoring_iterations + 1\n",
    "        beta_new <- beta_old + solve(iter_I) %*% iter_U\n",
    "        \n",
    "        if (all(abs(beta_new - beta_old) < epsilon)) {\n",
    "            model_parameters  <- beta_new\n",
    "            fitted_values     <- pi_i(X %*% model_parameters)\n",
    "            covariance_matrix <- solve(iter_I)\n",
    "            break\n",
    "            \n",
    "        } else {\n",
    "            iter_p   <- pi_i(X %*% beta_new)\n",
    "            iter_W   <- diag(as.vector(iter_p*(1-iter_p)))\n",
    "            iter_I   <- t(X) %*% iter_W %*% X\n",
    "            iter_U   <- t(X) %*% (y - iter_p)\n",
    "            beta_old <- beta_new\n",
    "        }\n",
    "    }\n",
    "    \n",
    "    summaryList <- list(\n",
    "        'model_parameters'=model_parameters,\n",
    "        'covariance_matrix'=covariance_matrix,\n",
    "        'fitted_values'=fitted_values,\n",
    "        'number_iterations'=fisher_scoring_iterations\n",
    "    )\n",
    "    return(summaryList)\n",
    "}\n",
    "```\n",
    "\n",
    "A quick summary of R's matrix symbols and operators:\n",
    "\n",
    "*  `%*%` is a stand-in for matrix multiplication\n",
    "*  `diag` returns a matrix with the provided vector as the diagonal and zero off-diagonal entries\n",
    "*  `t` returns the transpose of the provided matrix\n",
    "*  `solve` returns the inverse of the provided matrix, if applicable\n",
    "\n",
    "We read the *Challenger* dataset into R and partition it into the design matrix and the response, which will then be passed to `getCoefficients`:\n",
    "\n",
    "```R\n",
    "df <- read.table(\n",
    "            file=\"Challenger.csv\", \n",
    "            header=TRUE, \n",
    "            sep=\",\", \n",
    "            stringsAsFactors=FALSE\n",
    "            )\n",
    "            \n",
    "X  <- as.matrix(cbind(1, df['TEMPERATURE']))    # design matrix\n",
    "y  <- as.matrix(df['O_RING_FAILURE'])           # response vector\n",
    "\n",
    "colnames(X) <- NULL\n",
    "colnames(y) <- NULL\n",
    "\n",
    "# call `getCoefficients`, keeping epsilon at .0001 =>\n",
    "mySummary <- getCoefficients(design_matrix=X, response_vector=y, epsilon=.0001)\n",
    "```\n",
    "\n",
    "Printing `mySummary` displays the model's estimated coefficents (*model_parameters*), the variance-covariance matrix of the coefficent estimates (*covariance_matrix*), the fitted values (*fitted_values*) and the number of Fisher Scoring iterations (*number_iterations*):  \n",
    "\n",
    "```R\n",
    "> print(mySummary)\n",
    "\n",
    "$model_parameters\n",
    "           [,1]\n",
    "[1,] 15.0429016\n",
    "[2,] -0.2321627\n",
    "\n",
    "$covariance_matrix\n",
    "           [,1]        [,2]\n",
    "[1,] 54.4442748 -0.79638682\n",
    "[2,] -0.7963868  0.01171514\n",
    "\n",
    "$fitted_values\n",
    "            [,1]\n",
    " [1,] 0.43049313\n",
    " [2,] 0.22996826\n",
    " [3,] 0.27362105\n",
    " [4,] 0.32209405\n",
    " [5,] 0.37472428\n",
    " [6,] 0.15804910\n",
    " [7,] 0.12954602\n",
    " [8,] 0.22996826\n",
    " [9,] 0.85931657\n",
    "[10,] 0.60268105\n",
    "[11,] 0.22996826\n",
    "[12,] 0.04454055\n",
    "[13,] 0.37472428\n",
    "[14,] 0.93924781\n",
    "[15,] 0.37472428\n",
    "[16,] 0.08554356\n",
    "[17,] 0.22996826\n",
    "[18,] 0.02270329\n",
    "[19,] 0.06904407\n",
    "[20,] 0.03564141\n",
    "[21,] 0.08554356\n",
    "[22,] 0.06904407\n",
    "[23,] 0.82884484\n",
    "\n",
    "$number_iterations\n",
    "[1] 6\n",
    "```\n",
    "      \n",
    "So for the Challenger dataset, our implementation of the Fisher Scoring algorithm yields a $\\hat{\\beta}_{0} = 15.0429016$ and $\\hat{\\beta}_{1} = -0.2321627$. In order to predict new probabilities of O-Ring Failure based on temperature, our model implies the following formula:\n",
    "\n",
    "$$\n",
    "\\pi = \\frac {e^{15.0429016 -0.2321627 * Temperature}}{1 + e^{15.0429016 -0.2321627 * Temperature}}\n",
    "$$\n",
    "\n",
    "Negative coefficents correspond to variables that are negatively correlated to the  probability of a positive outcome, the reverse being true for positive coefficents.\n",
    "        \n",
    "Lets compare the results of our Fisher Scoring algorithm with the output of `glm` using the same dataset, and specifying `family=\"binomial\"` and `link=\"logit\"`:\n",
    "\n",
    "```R\n",
    "df <- read.table(\n",
    "            file=\"Challenger.csv\", \n",
    "            header=TRUE, \n",
    "            sep=\",\", \n",
    "            stringsAsFactors=FALSE\n",
    "            )\n",
    "\n",
    "X  <- as.matrix(cbind(1, df['TEMPERATURE']))    # design matrix\n",
    "y  <- as.matrix(df['O_RING_FAILURE'])           # response vector\n",
    "\n",
    "colnames(X) <- NULL\n",
    "colnames(y) <- NULL\n",
    "\n",
    "logistic.fit <- glm(\n",
    "                 formula=O_RING_FAILURE ~ TEMPERATURE,\n",
    "                 family=binomial(link=logit),\n",
    "                 data=df\n",
    "                 )\n",
    "\n",
    "```\n",
    "\n",
    "From `logistic.fit`, we'll extract *coefficients* (to compare estimated coefficients), *fitted.values* (to compare fitted values), *iter* (to compare the number of Fisher Scoring Iterations), and call `vcov(logistic.fit)` to obtain the variance-covariance matrix of the estimated coefficents (recall our estimated coefficents were 15.0429016 (Intercept) and -0.2321627 (TEMPERATURE):\n",
    "\n",
    "\n",
    "```R\n",
    "> logistic.fit$coefficients\n",
    "\n",
    "(Intercept) TEMPERATURE \n",
    " 15.0429016  -0.2321627 \n",
    " \n",
    "> matrix(logistic.fit$fitted.values)\n",
    "\n",
    "      [,1]\n",
    " [1,] 0.43049313\n",
    " [2,] 0.22996826\n",
    " [3,] 0.27362105\n",
    " [4,] 0.32209405\n",
    " [5,] 0.37472428\n",
    " [6,] 0.15804910\n",
    " [7,] 0.12954602\n",
    " [8,] 0.22996826\n",
    " [9,] 0.85931657\n",
    "[10,] 0.60268105\n",
    "[11,] 0.22996826\n",
    "[12,] 0.04454055\n",
    "[13,] 0.37472428\n",
    "[14,] 0.93924781\n",
    "[15,] 0.37472428\n",
    "[16,] 0.08554356\n",
    "[17,] 0.22996826\n",
    "[18,] 0.02270329\n",
    "[19,] 0.06904407\n",
    "[20,] 0.03564141\n",
    "[21,] 0.08554356\n",
    "[22,] 0.06904407\n",
    "[23,] 0.82884484\n",
    "\n",
    "> logistic.fit$fitted.iter\n",
    "\n",
    "5\n",
    "\n",
    "> vcov(logistic.fit)\n",
    "\n",
    "             (Intercept) TEMPERATURE\n",
    "(Intercept)  54.4441826 -0.79638547\n",
    "TEMPERATURE  -0.7963855  0.01171512\n",
    "```\n",
    "\n",
    "Our coefficients match exactly with those generated by `glm`, and as would be expected, the fitted values are also identical. \n",
    "      \n",
    "Notice there's some discrepancy in the estimate of the variance-covariance matrix beginning with the 4th decimal (54.4442748 in our algorithm vrs. 54.4441826 for the variance of the Intercept term from `glm`). This may be due to rounding, or the loss of precision in floating point values when inverting matricies. Alternatively, notice our algorithm used one more Fisher Scoring iteration than `glm` (6 vrs. 5). Perhaps increasing the size of our epsilon will reduce the number of Fisher Scoring iterations, which in turn may lead to better agreement between the variance-covariance matricies.  \n",
    "\n",
    "Calling `summary(logistic.fit)` prints, among other things, the Standard Error of the coefficent estimates:\n",
    "\n",
    "\n",
    "```R\n",
    "> summary(logistic.fit)\n",
    "\n",
    "Coefficients:\n",
    "            Estimate Std. Error z value Pr(>|z|)  \n",
    "(Intercept)  15.0429     7.3786   2.039   0.0415 *\n",
    "TEMPERATURE  -0.2322     0.1082  -2.145   0.0320 *\n",
    "```\n",
    "\n",
    "The *Std. Error* values are the square root of the diagonal elements of the variance-covariance matrix, $\\sqrt{54.4441826} = 7.3786$ and $\\sqrt{0.01171512} = 0.1082$.     \n",
    "  \n",
    "Also, *z value* is the estimated coefficent divided by the *Std. Error*. In our example,  $15.0429/7.3786 = 2.039$ and $-0.2322/0.1082 = -2.145$. *Pr(>|z|)* is the p-value, which tells us whether we should trust the estimated coefficent value. The standard rule of thumb is that coefficents with p-values less than 0.05 are reliable, although some tests require stricter thresholds. \n",
    "\n",
    "\n",
    "A feature of Logistic Regression is that the training data's marginal probabilities are preserved. If you aggregate the fitted values from the training set, that quanity will equal the number of positive outcomes in the response vector:\n",
    "\n",
    "```R\n",
    "#y is the dicotomous response vector =>\n",
    "\n",
    "> sum(y)\n",
    "7\n",
    "\n",
    "#checking sum for our algorithm =>\n",
    "> sum(mySummary$fitted_values)\n",
    "7\n",
    "\n",
    "#checking sum for glm =>\n",
    "> sum(logistic.fit$fitted.values)\n",
    "7\n",
    "```\n",
    "    \n",
    "\n",
    "\n",
    "## Using The Model to Calculate Probabilities\n",
    "\n",
    "In R, to apply the model generated by `glm` to a new set of explanatory variables, use the \n",
    " [predict](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.lm.html) function. Pass a list or data.frame of explanatory variables to `predict`, and for Logistic Regression models, be sure to set `type=\"response\"` to ensure probabilities are returned. For example:\n",
    " \n",
    "```R \n",
    "# new inputs for Logistic Regression model =>\n",
    "> tempsDF <- data.frame(TEMPERATURE=c(24, 41, 46, 47, 61))\n",
    "\n",
    "> predict(logistic.fit, tempsDF, type=\"response\")\n",
    " \n",
    "        1         2         3         4         5 \n",
    "0.9999230 0.9960269 0.9874253 0.9841912 0.7070241\n",
    "```\n",
    "\n",
    "\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "This post was an attempt to shed some light on the calculation routines used in estimating Logistic Regression model coefficients in R. In future posts, we'll explore alternative estimations routines, and dig deeper into the statistics generated by the `glm` function, which can be used in determining the significance and/or the goodness-of-fit of a given model. Until then, happy coding!\n",
    "\n",
    "\n",
    "\n",
    "### Footnotes:\n",
    "[1] - https://stat.ethz.ch/R-manual/R-devel/library/stats/html/glm.html        \n",
    "[2] - Agresti, A. (2002). *Categorical Data Analysis* (2nd Ed.)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.1.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }