-1

I am looking into how random initialisation of a model would impact final results after tuning. This is a well known problem for deep learning (NN or gbdt), notably with random initialisation and stochastic learning processes.

It seems to me that modern implementation of a simpler model, such as linear/logistic regression, can suffer from this. I suggest this because the sklearn implementation of logistic regression allows for setting the random seed parameter.

I know that vanilla linear regression, being a form of projection, only has a unique solution. Logistic regression, as a convex optimisation problem, has a unique solution too but, given its non-linear 'activation', hasn't an analytical solution. Uniqueness of a solution also depends on collinearity of the features and regularisation added (If X1 and X2 are perfectly colinear, error is constant along the lines X1 + X2 = cst, allowing multiple solutions). Some of the problem seems to appears with stochastic learning algorithms, when parameters are not set correctly (not enough learning steps), but sklearn seems to issue a warning if the algorithm didn't converge.

I am wondering about concrete examples, where the logistic regression could converge to different results with different random seeds, why it happens, if it can happen 'silently', and if there is a general rule of thumbs to avoid this.

Do you have any concrete example of logistic regression converging to different values, depending on initial seed? (Preferably with real life data, ideally with a serious source, and, if applicable, a problematic case where sklearn wouldn't issue a warning.)

Lucas Morin
  • 1,575
  • 4
    I'm not the downvote, but logistic regression (which seems to be your priority) has a convex cost function; there's a global minimum, unlike a deep network. No concrete examples of the type you describe exist, because there's only one value to converge to (any optimizer instability notwithstanding), unless there's perfect separation...that's another story. – Arya McCarthy Sep 01 '23 at 13:12
  • 1
    @AryaMcCarthy ... with conditions! With only one $X$ having separation of $Y$, the odds ratio is infinite ($\pm \infty$), and there is a range of possible intercept parameters that can maximize the likelihood. In that case if you are lucky enough to pick the right initialization, the fitter converges in 0 steps at one non-unique MLE. – AdamO Sep 01 '23 at 14:48
  • @AryaMcCarthy: thanks for your answer. This is what I was wondering wether the new optimizer / regularisation terms change the logic convex cost => unique solution. – Lucas Morin Sep 01 '23 at 15:03
  • @AdamO: thanks for your input. I think we can set degenerate cases aside. I'd like to think that I would know / notice perfect separability before putting my data in a model. – Lucas Morin Sep 01 '23 at 15:03
  • 1
    @LucasMorin I disagree with you on this. People assume a simple cross tabs and error message reveal the nature of this problem when fitting complex multivariate logistic models. This isn't true. There is a little known fact that fitters tend to converge on biased ORs in multivariate analyses when data are only partly separated - research small sample bias in LR. It's amazing that for a model as "simple" as logistic regression, there are several sophisticated methods required to even fit and interpret model effects correctly. – AdamO Sep 01 '23 at 15:07
  • @LucasMorin Furthermore, I think to be successful in this area, you must state your assumptions rather than retrofit them to the scenarios you describe. Not only does it get better answers on this forum, it's practically required to communicate the type of work we do. – AdamO Sep 01 '23 at 15:09
  • @AryaMcCarthy: I changed my post to take your comment into account. It seems I misrembered the assumption. The sigmoid doesn't remove the uniqueness but prevent the analytical form of the solution. – Lucas Morin Sep 01 '23 at 15:18
  • @AdamO: it is exactly the goal of my question, to know wheter any of those sophisticated method could 'break' the uniqueness aspect of the solution. (Or if changing the seed could produce different meaningful solutions - outside of edge cases). Edit: I also removed some suggested edit that made me 'claim' wrong things. – Lucas Morin Sep 01 '23 at 15:53
  • @LucasMorin try to understand the algorithms rather than simply labelling them as 'modern', 'sophisticated'. the saga/sag algorithms are basically a form of stochastic gradient descent - aimed at large sample problems where you are computation bound - you might not even reach the minimum. – seanv507 Sep 02 '23 at 12:25
  • @seanv507: it is not me who has labeled it as 'sophisticated' in the first place... this over-reading is a bit exhausting. > you might not even reach the minimum < Well this is exactly what i am asking... to what extent the stochastic learning process can give a different answer when changing the seed ? – Lucas Morin Sep 02 '23 at 15:41

2 Answers2

7

Both linear and logistic regression are convex optimisation problems and have same behaviour.

If the 2nd derivative of the objective at the minimum is positive definite, then the minimum is unique, otherwise you have multiple solutions. L2 regularisation ensures uniqueness

The only difference is that logistic regression may have a solution at infinity - the so called perfect separation case.

seanv507
  • 6,743
  • Ok, so the random_state shouldn't impact the result at all ? And discrepencies only appears when there is a convergence problem ? – Lucas Morin Sep 01 '23 at 14:52
  • so random_state is specific to the algorithm used to find the minimum. (only liblinear, saga and sag) use it in scikit-learn, and you would have to check the details of each algorithm, but yes it would be a convergence issue. – seanv507 Sep 02 '23 at 10:15
  • Ok, now, my orginal question: can you provide an exemple (dataset, paper... etc.) where this convergence didn't happen and how much it mattered ? – Lucas Morin Sep 02 '23 at 16:11
5
  • The solution to OLS is not unique unless the design matrix is of full rank. If the matrix is rank deficient, there are actually several possible "solutions" using the pseudoinverse of $(X^{T}X)$. The pseudoinverse satisfies $(X^{T}X) = (X^{T}X)(X^{T}X)^{-1}(X^{T}X)$ and exists but is not unique for rank deficient matrices. It gives a solution to the OLS likelihood equation.
  • In general an activation function is some smooth monotonic function that relates two nodes in a neural network. Under this condition, your assumption is true. However, a link function in a GLM may or may not satisfy this condition. Your intuition helps here: realize you are trading an infinite space of functions for a handful of parameters in the real space - this is more efficient.
  • For probability models in the exponential family of distributions, when the activation function happens to be from the natural parameterization, that is so that the inverse derivative of the condition mean is equal to the variance, then the solution to Newton Raphson or Fisher Scoring will converge to the MLE across the space of all initial conditions. The assumption is that the parameter is not on the boundary of the parameter space.
AdamO
  • 62,637
  • Thanks for this theoretical answer, it clarifies some aspect of my question. However you are not really adressing the role of the random state in the learning process, and how it can affect (or not) the final result in meaningful cases. – Lucas Morin Sep 02 '23 at 09:51