1

By reviewing the existing relevant questions I could not find the answer to this specific question.

I have created blocking variables with the one-hot method (n - 1 binary variables for n categorical variables corresponding to different sources of data). Then I performed LASSO regression with cross-validation on my data after z-score normalization of predictor variables and centering of responce variables. I used Matlab for that, which provides automatically the lambda values at which the minimum standard error (MinMSE) and the sparsest model that is one standard error from MinMSE (1SE) are observed. For both the blocking variables were included in the model.

So the question is, how can I choose variables more significant than blocking? One method I thought for that is to look for a lambda value at which the blocking coefficients become zero and choose the variables that still remain at that point (with non-zero coefficients). This actually gave some reasonalbe results. My problem is that I could not find such a treatment anywhere else in the literature and I am afraid it might be dead wrong.

  • 1
    If you are using LASSO and regularisation then it would be better to use $n$ dummy or indicator variables rather than $n-1$ ("blocking" is a curious machine learning neologism for something that already has two names) as it will make a difference to your results as to which one you leave out. You need $n-1$ on ordinary least squares regression to avoid multicollinearity and the choice as to which essentially does not affect the results, but multicollinearity should not be an issue with regularisation. See https://stats.stackexchange.com/a/329281/2958 – Henry Aug 03 '23 at 23:39
  • Thank you I will look into that. I doupt one more indicator variable would make a difference at the end result however. – Oculatus Aug 04 '23 at 00:23
  • @Henry blocking is not a "curious machine learning neologism", lmao. It dates back to the very inception of statistics. Blocking is a concept in experiment design, distinct from "dummy/indicator variables", which is an encoding strategy. (But I second your point about using all $n$ variables; that makes sense to me too). – John Madden Aug 04 '23 at 19:36
  • Dear OP: intuitively, this seems reasonable. But there may be some technical issues related to the difference between significance and effect size: Lasso may well threshold significant coefficients with very small effect sizes to zero, so I'm not sure that we could argue that a variable thresholded before another one is "less significant" (if this is interpreted in the sense of obtained significance levels). – John Madden Aug 04 '23 at 19:41
  • @JohnMadden thank you for the reply. I understand your point, but if we cannot use the criterion of variable exclusion for the lambda value I chose (where all indicator variable coefficients are zero) then the same would apply to all lambda values including MinMSE and 1SE. Then what is the point of LASSO regression? – Oculatus Aug 04 '23 at 20:42
  • @Oculatus you're right: I'm referring to issues with Lasso that go beyond your particular problem instance. Just because Lasso has issues doesn't mean there's no "point" to using it. We just need to keep those in mind. – John Madden Aug 04 '23 at 20:47
  • 1
    LASSO is not for testing the statistical significance of variables: it's to help build an effective prediction model. – whuber Aug 04 '23 at 21:40
  • @JohnMadden Google nGrams seems to suggest the use of "blocking variables" became noticeable around 1970, later and still much less common than "dummy variables" or "indicator variables" – Henry Aug 04 '23 at 22:07
  • @whuber perhaps using the word "significant" in the question was wrong since my goal is to build an effective prediction model. How can I characterize the variables chosen in this context? Is "important" a more valid adjective? – Oculatus Aug 04 '23 at 23:30
  • "Useful" might be the most accurate characterization. – whuber Aug 05 '23 at 15:36

0 Answers0