Estimating Logistic Regression Coefficents From Scratch in Python

In this post, we'll demonstrate how to estimate the coefficents of a Logistic Regression model using the Fisher Scoring algorithm in Python. We will then compare our estimates to those generated by scikit-learn’s linear_model.LogisticRegression class when fed the same dataset.

In a Generalized Linear Model, the response may have any distribution from the exponential family, and rather than assuming the mean is a linear function of the explnatory variables, we assume that a function of the mean, or the link function, is a linear function of the explnatory variables.

Logistic Regression is used for modeling data with a categorical response. Although it’s possible to model multinomial data using Logistic Regression, in this post we’ll limit our analysis to models having a dichotomous response, where the outcome can be classified as 'Yes/No' or '1/0'.

The Logistic Regression model is a Generalized Linear Model whose canonical link is the logit, or log-odds:

$$ \mathrm{Ln} \big(\frac{\pi_{i}}{1 - \pi_{i}} \big) = \beta_{0} + \beta_{1}x_{i1} + \cdots + \beta_{p}x_{ip} $$

for $i = (1, \cdots , n)$.

Solving the logit for $\pi_{i}$, which is a stand-in for the predicted probability associated with observation $x_{i}$, yields

$$ \pi_{i} = \frac {e^{\beta_{0} + \beta_{1}x_{i1} + \cdots + \beta_{p}x_{ip}}}{1 + e^{\beta_{0} + \beta_{1}x_{i1} + \cdots + \beta_{p}x_{ip}}} = \frac {1}{1 + e^{-(\beta_{0} + \beta_{1}x_{i1} + \cdots + \beta_{p}x_{ip})}}, $$

where $-\infty < x_{i} < \infty$ and $0 < \pi_{i }< 1$.

Parameter Estimation

Maximum Likelihood Estimation can be used to determine the parameters of a Logistic Regression model, which entails finding the set of parameters for which the probability of the observed data is greatest. The objective is to estimate the $p + 1$ unknown $\beta_{0}, \cdots ,\beta_{p}$.

Let $Y_{i}$ represent independent, dicotomous response values for each of $n$ observations, where $Y_i=1$ denotes a success and $Y_i=0$ denotes a failure. The density function of a single observation $Y_i$ is given by

$$ p(y_{i}) = \pi_{i}^{y_{i}}(1-\pi_{i})^{1-y_{i}}, $$

and the corresponding likelihood function is

$$ L(\beta) = \prod_{i=1}^{n} \pi_{i}^{y_{i}}(1-\pi_{i})^{1-y_{i}}. $$

Taking the natural log of the maximum likelihood estimate results in the log-likelihood function:

$$ \begin{align*} l(\beta) &= \mathrm{Ln}(L(\beta)) = \mathrm{Ln} \Big(\prod_{i=1}^{n} \pi_{i}^{y_{i}}(1-\pi_{i})^{1-y_{i}} \Big) = \sum_{i=1}^{n} y_{i} \cdot \mathrm{Ln}(\pi_{i}) + (1-y_{i}) \cdot \mathrm{Ln}(1-\pi_{i})\\ &= \sum_{i=1}^{n} y_{i} \cdot \mathrm{Ln} \Big(\frac {e^{\beta_{0} + \beta_{1}x_{i1} + \cdots + \beta_{p}x_{ip}}}{1 + e^{\beta_{0} + \beta_{1}x_{i1} + \cdots + \beta_{p}x_{ip}}} \Big) + (1 - y_{i}) \cdot \mathrm{Ln} \Big(\frac {1}{1 + e^{\beta_{0} + \beta_{1}x_{i1} + \cdots + \beta_{p}x_{ip}}} \Big)\\ &= \sum_{i=1}^{n} y_{i}(\beta_{0} + \beta_{1}x_{i1} + \cdots + \beta_{p}x_{ip}) - \mathrm{Ln}(1 + e^{\beta_{0} + \beta_{1}x_{i1} + \cdots + \beta_{p}x_{ip}})\\ \end{align*} $$

The first-order partial derivatives of the log-likelihood are calculated and set to zero for each $k = 0, 1, \cdots, p$

$$ \frac {\partial l(\beta)}{\partial \beta_{k}} = \sum_{i=1}^{n} y_{i}x_{ik} - \pi_{i}x_{ik} = \sum_{i=1}^{n} x_{ik}(y_{i} - \pi_{i}) = 0, $$

which can be represented in matrix notation as

$$ \frac {\partial l(\beta)}{\partial \beta} = X^{T}(y - \pi), $$

where $X^{T}$ is a (p + 1)-by-n matrix and $(y - \pi)$ an n-by-1 vector.

The vector of first-order partial derivatives of the log-likelihood function is referred to as the score function in statistical literature, and is typically represented as $U$.

These (p+1) equations are solved simultaneously to obtain the parameter estimates $\hat\beta_{0}, \cdots ,\hat\beta_{p}$.

Each solution specifies a critical-point which will be either a maximum or a minimum. The critical point will be a maximum if the matrix of second partial derivatives is negative definite (which means every element on the diagonal of the matrix is less than zero).

The matrix of second partial derivatives is given by

$$ \frac{\partial^{2} l(\beta)}{{\partial \beta_{k}}{\partial \beta_{k}}^{T}} = - \sum_{i=1}^{n} x_{ik}\pi_{i}(1-\pi_{i}){x_{ik}}^{T}, $$

represented in matrix form as

$$ \frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}} = -X^{T}WX, $$

where $W$ is an n-by-n diagonal matrix of weights with each element equal to $\pi_{i}(1 - \pi_{i})$ for Logistic Regression models (in general, the weights matrix $W$ will have entries inversely proportional to the variance of the response).

Since no closed-form solution exists for determining Logistic Regression model coefficents (as exists for Linear Regression models), iterative techniques must be employed.

Fitting the Model

Two distinct but related iterative methods can be utilized in determining model coefficents: the Newton-Raphson method and Fisher Scoring. The Newton-Raphson method relies on the matrix of second partial derivatives, also known as the Hessian. The Newton-Raphson update expression is:

$$ \beta^{(t+1)} = \beta^{(t)} - (H^{(t)})^{-1}U^{(t)} $$

where:

$\beta^{(t+1)}$ = the vector of updated coefficent estimates.
$\beta^{(t)}$ = the vector of coefficent estimates from the previous iteration.
$(H^{(t)})^{-1}$ = the inverse of the Hessian, $\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)^{-1}$.
$U^{(t)}$ = the vector of first-order partial derivatives of the log-likelihood function, $X^{T}(y - \pi)$.

The Newton-Raphson method starts with an initial guess for the solution, and obtains a second guess by approximating the function to be maximized in a neighborhood of the initial guess by a second-degree polynomial, and then finding the location of that polynomial’s maximum value. This process continues until it converges to the actual solution. The convergence of $\beta^{t}$ to $\hat{\beta}$ is usually fast, with adequate convergence realized after fewer than 10-20 iterations.

An alternative method, Fisher scoring, utilizes the expected information $-E\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)$. Let $\mathcal{I}$ serve as a stand-in for the expected value of the information:

$$ \mathcal{I} = -E\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big). $$

The Fisher scoring update step replaces $-H^{(t)}$ from Newton-Raphson with $\mathcal{I}^{(t)}$:

$$ \begin{align*} \beta^{(t+1)} &= \beta^{(t)} + (\mathcal{I}^{(t)})^{-1}U^{(t)}\\ &= \beta^{(t)} + (X^{T}WX)^{-1}X^{T}(y - \pi)\\ \end{align*} $$

where:

$\beta^{(t+1)}$ = the vector of updated coefficent estimates.
$\beta^{(t)}$ = the vector of coefficent estimates from the previous iteration.
$(\mathcal{I}^{(t)})^{-1}$ = the inverse of the expected information matrix, $-E \Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)^{-1}$.
$U^{(t)}$ = the vector of first-order partial derivatives of the log-likelihood function, $X^{T}(y - \pi)$.

For GLMs with a canonical link (of which employing the logit for Logistic Regression is an example), the observed and expected information are the same. When the response follows an exponential family distribution, and the canonical link function is employed, observed and expected information coincide so that Fisher scoring and Newton-Raphson are identical.

When the canonical link is used, the second partial derivatives of the log-likelihood do not depend on the observation $y_i$, and therefore

$$ \frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}} = E \Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}} \Big). $$

Fisher scoring has the advantage that it produces the asymptotic covariance matrix as a by-product.

To summarize:

The Hessian is the matrix of second partial derivatives of the log-likelihood with respect to the parameters, $H = \frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}$.
The observed information is $-\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}$.
The expected information is $\mathcal{I} = E\Big(-\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)$.
The asymptotic covariance matrix is $\mathrm{Var}(\hat{\beta}) = E\Big(-\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)^{-1} = (X^{T}WX)^{-1}$.

For models employing a canonical link function:

The observed and expected information are the same, $\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}} = E\Big(\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)$.
$H = -\mathcal{I}$, or $\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}} = E\Big(-\frac{\partial^{2} l(\beta)}{{\partial \beta}{\partial \beta}^{T}}\Big)$.
The Newton-Raphson and Fisher Scoring algorithms yield identical results.

Fisher Scoring Implementation

The data used for our sample calculation can be obtained here. The data represents O-Ring failures in the 23 pre-Challenger space shuttle missions. TEMPERATURE will serve as the single explnatory variable which will be used to predict O_RING_FAILURE, which is 1 if a failure occurred, 0 otherwise.

Once the parameters have been determined, the model estimate of the probability of success for a given observation can be calculated with:

$$ \hat\pi_{i} = \frac {e^{\hat\beta_{0} + \hat\beta_{1}x_{i1} + \cdots + \hat\beta_{p}x_{ip}}}{1 + e^{\hat\beta_{0} + \hat\beta_{1}x_{i1} + \cdots + \hat\beta_{p}x_{ip}}} $$

In the following code, we define a single function, get_params, which returns the estimated model coefficents as a (p+1)-by-1 array. In addition, the function returns the number of scoring iterations, fitted values and the variance-covariance matrix for the estimated coefficients.

import numpy as np
from numpy.linalg import inv


def get_params(X, y, epsilon=.001):
    """
    Determine Logistic Regression coefficents using Fisher Scoring algorithm.
    Iteration ceases once changes between elements in coefficent matrix across
    consecutive iterations is less than epsilon.
   
        - design_matrix      `X` : n-by-(p+1)                                
        - response_vector    `y` : n-by-1                                   
        - probability_vector `p` : n-by-1                                   
        - weights_matrix     `W` : n-by-n                                    
        - epsilon                : threshold above which iteration continues
        - n                      : # of observations                        
        - (p + 1)                : # of parameters (+1 for intercept term) 
   
        - U: First derivative of log-likelihood with respect to                
           each beta_i, i.e. "Score Function" = X^T * (y - p)        
           
        - I: Second derivative of log-likelihood with respect to               
           each beta_i, i.e. "Information Matrix" = (X^T * W * X)      
                                                                           
        - X^T*W*X results in a (p + 1)-by-(p + 1) matrix.                          
        - X^T(y - p) results in a (p+1)-by-1 matrix.                            
        - (X^T*W*X)^-1 * X^T(y - p) results in a (p + 1)-by-1 matrix.             
   
    """
    def sigmoid(v): 
        return (1 / (1 + np.exp(-v)))
    
    betas0 = np.zeros(X.shape[1]).reshape(-1, 1)
    p = sigmoid(X @ betas0)
    W = np.diag((p * (1 - p)).ravel())
    I = X.T @ W @ X
    U = X.T @ (y - p)

    n_iter = 0
    coeffs = []
    epsilon = 1e-5

    while True:
        n_iter+=1
        coeffs.append(np.array(betas0))
        betas = betas0 + inv(I) @ U
        betas = betas.reshape(-1, 1)

        if np.all(np.abs(betas - betas0) < epsilon):
            break
        else:
            p = sigmoid(X @ betas)
            W = np.diag((p * (1 - p)).ravel())
            I = X.T @ W @ X
            U = X.T @ (y - p)
            betas0 = betas

    dresults = {
        "params": coeffs[-1],
        "ypred": sigmoid(X @ betas),
        "V": inv(I),
        "n_iter": n_iter
        }

    return(dresults)

We read in the Challenger dataset and partition it into the design matrix and response vector, which are then passed to get_params:

df = pd.read_csv("https://gist.githubusercontent.com/jtrive84/835514a76f7afd552c999e4d9134baa8/raw/6dac51b80f892ef051174a46766eb53c7b609ebd/Challenger.csv")
X0 = df[["TEMPERATURE"]].values
X = np.concatenate([np.ones(X0.shape[0]).reshape(-1, 1), X0], axis=1)
y = df[["O_RING_FAILURE"]].values

dresults = get_params(X, y)

get_params returns a dictionary consisting of the following keys:

"params": Model estimated coefficents
"ypred": Fitted values
"V": Variance-covariance matrix of the coefficent estimates
"n_iter": Number of Fisher scoring iterations

Inspecting the contents of dresults yields the following:

In [4]: dresults["params"]
Out[4]:
array([[15.04290163],
       [-0.23216274]])
       
In [5]: dresults["V"]
array([[ 5.44442748e+01, -7.96386824e-01],
       [-7.96386824e-01,  1.17151446e-02]])
       
In [6]: dresults["n_iter"]
Out[6]: 6

In [7]: dresults["ypred"]
Out[7]:
array([[0.43049313],
       [0.22996826],
       [0.27362105],
       [0.32209405],
       [0.37472428],
       [0.1580491 ],
       [0.12954602],
       [0.22996826],
       [0.85931657],
       [0.60268105],
       [0.22996826],
       [0.04454055],
       [0.37472428],
       [0.93924781],
       [0.37472428],
       [0.08554356],
       [0.22996826],
       [0.02270329],
       [0.06904407],
       [0.03564141],
       [0.08554356],
       [0.06904407],
       [0.82884484]])

So for the Challenger dataset, our implementation of Fisher scoring results in a model with $\hat \beta_{0} = 15.0429016$ and $\hat \beta_{1} = -0.2321627$. In order to predict new probabilities of O-Ring Failure based on temperature, our model implies the following formula:

$$ \pi = \frac {1}{1 + e^{-(15.0429016 -0.2321627 \times \mathrm{TEMPERATURE})}} $$

Negative coefficents correspond to features that are negatively correlated to the probability of a positive outcome, with the reverse being true for positive coefficents.

Lets compare the results of our implementation against the estimates generated within statsmodels:

import statsmodels.formula.api as smf

mdl = smf.logit("O_RING_FAILURE ~ TEMPERATURE", data=df).fit()

We can access attributes of the fit model to compare with out results:

In [8]: mdl.params
Out[8]: 
Intercept      15.042902
TEMPERATURE    -0.232163
dtype: float64

In [9]: mdl.cov_params()
Out[9]:
             Intercept  TEMPERATURE
Intercept    54.444275    -0.796387
TEMPERATURE  -0.796387     0.011715

In[10]: mdl.predict(df)
Out[10]:
0     0.430493
1     0.229968
2     0.273621
3     0.322094
4     0.374724
5     0.158049
6     0.129546
7     0.229968
8     0.859317
9     0.602681
10    0.229968
11    0.044541
12    0.374724
13    0.939248
14    0.374724
15    0.085544
16    0.229968
17    0.022703
18    0.069044
19    0.035641
20    0.085544
21    0.069044
22    0.828845
dtype: float64

We can see that the values produced using the statsmodels api match our results exactly.

A feature of Logistic regression models (more specifically, any GLM that utilizes a canonical link function) is that the training data's marginal probabilities are preserved. If you aggregate the fitted values from the training set, that quanity will equal the number of positive outcomes in the response vector:

In[11]: dresults["ypred"].sum()
Out[11]: 6.999999999999998

In[12]: mdl.predict(df).sum()
Out[12]: 7.000000000000003

In[13]: df.O_RING_FAILURE.values.nonzero()[0].size
Out[13]: 7

We have 7 positive instances in our dataset, and the total probability aggregates to 7 in both the from scratch and statsmodels implementations.

jtrive84/b_logistic_python.md

Estimating Logistic Regression Coefficents From Scratch in Python

Parameter Estimation

Fitting the Model

Fisher Scoring Implementation