32

I am using Python's scikit-learn to train and test a logistic regression.

scikit-learn returns the regression's coefficients of the independent variables, but it does not provide the coefficients' standard errors. I need these standard errors to compute a Wald statistic for each coefficient and, in turn, compare these coefficients to each other.

I have found one description of how to compute standard errors for the coefficients of a logistic regression (here), but it is somewhat difficult to follow.

If you happen to know of a simple, succint explanation of how to compute these standard errors and/or can provide me with one, I'd really appreciate it! I don't mean specific code (though please feel free to post any code that might be helpful), but rather an algorithmic explanation of the steps involved.

Gyan Veda
  • 1,016
  • 2
  • 14
  • 24

4 Answers4

39

The standard errors of the model coefficients are the square roots of the diagonal entries of the covariance matrix. Consider the following:

  • Design matrix:

$\textbf{X = }\begin{bmatrix} 1 & x_{1,1} & \ldots & x_{1,p} \\ 1 & x_{2,1} & \ldots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & \ldots & x_{n,p} \end{bmatrix}$ , where $x_{i,j}$ is the value of the $j$th predictor for the $i$th observations.

(NOTE: This assumes a model with an intercept.)

  • $\textbf{V = } \begin{bmatrix} \hat{\pi}_{1}(1 - \hat{\pi}_{1}) & 0 & \ldots & 0 \\ 0 & \hat{\pi}_{2}(1 - \hat{\pi}_{2}) & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \hat{\pi}_{n}(1 - \hat{\pi}_{n}) \end{bmatrix}$ , where $\hat{\pi}_{i}$ represents the predicted probability of class membership for observation $i$.

The covariance matrix can be written as:

$\textbf{(X}^{T}\textbf{V}\textbf{X)}^{-1}$

This can be implemented with the following code:

import numpy as np
from sklearn import linear_model

# Initiate logistic regression object
logit = linear_model.LogisticRegression()

# Fit model. Let X_train = matrix of predictors, y_train = matrix of variable.
# NOTE: Do not include a column for the intercept when fitting the model.
resLogit = logit.fit(X_train, y_train)

# Calculate matrix of predicted class probabilities.
# Check resLogit.classes_ to make sure that sklearn ordered your classes as expected
predProbs = resLogit.predict_proba(X_train)

# Design matrix -- add column of 1's at the beginning of your X_train matrix
X_design = np.hstack([np.ones((X_train.shape[0], 1)), X_train])

# Initiate matrix of 0's, fill diagonal with each predicted observation's variance
V = np.diagflat(np.product(predProbs, axis=1))

# Covariance matrix
# Note that the @-operater does matrix multiplication in Python 3.5+, so if you're running
# Python 3.5+, you can replace the covLogit-line below with the more readable:
# covLogit = np.linalg.inv(X_design.T @ V @ X_design)
covLogit = np.linalg.inv(np.dot(np.dot(X_design.T, V), X_design))
print("Covariance matrix: ", covLogit)

# Standard errors
print("Standard errors: ", np.sqrt(np.diag(covLogit)))

# Wald statistic (coefficient / s.e.) ^ 2
logitParams = np.insert(resLogit.coef_, 0, resLogit.intercept_)
print("Wald statistics: ", (logitParams / np.sqrt(np.diag(covLogit))) ** 2)

All that being said, statsmodels will probably be a better package to use if you want access to a LOT of "out-the-box" diagnostics.

AllanLRH
  • 223
  • 2
  • 7
j_sack
  • 391
  • 5
    To avoid memory issues and to account for singular matrix case, you could update your code as follows - V = np.product(predProbs, axis=1); covLogit = np.linalg.pinv(np.dot(X_design.T * V), X_design) – steadyfish May 01 '19 at 19:27
  • Why not use np.sqrt(np.diagonal(np.cov(x,y))) directly? – Alex Moore-Niemi Apr 28 '20 at 01:05
14

Does your software give you a parameter covariance (or variance-covariance) matrix? If so, the standard errors are the square root of the diagonal of that matrix. You probably want to consult a textbook (or google for university lecture notes) for how to get the $V_\beta$ matrix for linear and generalized linear models.

generic_user
  • 13,339
  • 1
    I haven't been able to find anything online for the generalized linear model case (maybe I don't know the right search terms?). Help? – Kevin H. Lin Dec 12 '14 at 03:56
  • 3
    Here is one that I found after a few minutes of googling. My advice is to first understand how the parameter variance is calculated in a basic linear model. Once you get that, the extension to GLM's is easier. All the same, knowing how to calculate it and knowing how to get it in a software package aren't the same thing. www.sagepub.com/upm-data/21121_Chapter_15.pdf – generic_user Dec 12 '14 at 16:14
7

If you're interested in doing inference, then you'll probably want to have a look at statsmodels. Standard errors and common statistical tests are available. Here's a logistic regression example.

jseabold
  • 775
  • 3
  • 7
  • Thanks for the recommendation! I'll look into statsmodels. Too bad that scikit-learn doesn't provide this sort of output. – Gyan Veda Mar 11 '14 at 15:11
  • 2
    Yeah. It's just usually not the goal of machine learning-type toolboxes to provide tools for (frequentist) hypothesis tests. If you run into data-size constraints that don't work well in statsmodels but do work in scikit-learn, I'd be interested to hear about them on github. – jseabold Mar 12 '14 at 16:46
  • @jseabold However, if you want to get some ad hoc notion of feature importance in logistic regression, you cannot just read off the effect sizes (the coefficients) without thinking about their standard errors. So even if you're not doing a frequentist test, and you just want some indication of effect sizes and robustness, the sklearn lack of variance output is challenging. – ely Aug 09 '15 at 18:18
1

Building upon the fantastic work of @j_sack I have two impulses to build on his code:

  1. Since predict_proba is giving you the probability for n-classes, the result is an nD-array and that is why one should(?) specify the class of interest, e.g.:

predProbs[:,0]

(check class of interest with resLogit.classes_)

  1. The diagonal matrix should not include the raw-predicted-probability but $w_{ii}=\pi_i(1-\pi_i) $ e.g:

    V = np.diagflat(np.product(predProbs[:,0] * (1-predProbs[:,0]), axis=1))

Artur
  • 111