4

Let's say that I want to run a logistic regression on a dataset with n observations and p variables and I have a bad model.

I can't understand why running again a logistic regression but this time with less variables (which is what happen with penalization) can improve my model...

I know that correlations and multicollinearity can be an issue. But my question is: why are correlations and multicollinearity an issue ?

Strictly speaking, logistic regression only performs a minimization of the cost function doesn't it ? $min_{\beta} \sum_{i=1}^{n}ln(1+e^{-y_{i}\sum_{j=1}^{p}\beta_{j}x_{i,j}})$

where the output of the $i^{th}$ observation is $y_{i} \in {-1,1}$, the value of the $j^{th}$ variable for this observation is $x_{i,j}$ and the coefficients are $\beta_{1},..., \beta_{p}$

Now let's say that I have model_1 that is a regression logistic performed on all the $j$ variables and I have model_2 that is a regression logistic performed on a subset of variables where correlations and collinearity have been removed.

How can model_2 can be better than model_1 as the minimization program is the same except that some $\beta_{j}$ are forced to be null in model_2 ? Why model_1 cannot obtain such a solution by putting the concerned $\beta_{j}$ to zero?

I am looking for a mathematical proof but any insights are more than welcome. Thanks !

2 Answers2

7

There are two sides to this coin.

  1. When you want to use the logistic regression for explaining the goal is to interpret the betas. When you have perfect multicolinearity in your data (say for $x_1$ and $x_2$), $b_1$ and $b_2$ (belonging to $x_1$ and $x_2$) are not reliable for interpretation. $b_1$ and $b_2$ could be 10 and 0, or 5 and 5, or 7 and 3, since that would yield the same result p (which is $\frac{1}{1+e^{logit}}$, and therefore the same error. That is why you generally do not want to use data that is multicollinear. Therefore you could call a model that does not have collinearity a 'better' model.

    1. When you are focussed on the 'p', as is the case for e.g. predictive modeling, it does not matter that the betas are not interpretable, as long as the value of 'p' is as close to the 'true' p as possible. In that case it might be preferable to use multicollinear data. See also

"Shmueli, G., 2010. To Explain or to Predict. Statistical Science, 25(3), pp. 289-310"

Ferdi
  • 5,179
LMB
  • 267
2

I don't have any mathematics for this, others here probably will.

I think it depends on what you mean by "better". It is true that adding a variable can never lower the fit and it would be very odd for it to not improve the fit. But good fit is not the only criterion.

You mention collinearity. There are at least two problems associated with collinearity: 1) High variances on the parameter estimates and 2) High sensitivity to small changes in the data (not in the predictions, but in the model). Those are problems.

Another problem is quasi-complete or complete separation - this can vanish when a variable is removed.

Related to this is overfitting. If you have a highly overfit model then it will not work well on new data.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383