Let's say that I want to run a logistic regression on a dataset with n observations and p variables and I have a bad model.
I can't understand why running again a logistic regression but this time with less variables (which is what happen with penalization) can improve my model...
I know that correlations and multicollinearity can be an issue. But my question is: why are correlations and multicollinearity an issue ?
Strictly speaking, logistic regression only performs a minimization of the cost function doesn't it ? $min_{\beta} \sum_{i=1}^{n}ln(1+e^{-y_{i}\sum_{j=1}^{p}\beta_{j}x_{i,j}})$
where the output of the $i^{th}$ observation is $y_{i} \in {-1,1}$, the value of the $j^{th}$ variable for this observation is $x_{i,j}$ and the coefficients are $\beta_{1},..., \beta_{p}$
Now let's say that I have model_1 that is a regression logistic performed on all the $j$ variables and I have model_2 that is a regression logistic performed on a subset of variables where correlations and collinearity have been removed.
How can model_2 can be better than model_1 as the minimization program is the same except that some $\beta_{j}$ are forced to be null in model_2 ? Why model_1 cannot obtain such a solution by putting the concerned $\beta_{j}$ to zero?
I am looking for a mathematical proof but any insights are more than welcome. Thanks !