2

I am using lasso regression to predict age (continuous data) from a set having 2112 numeric features (indepedent variable).

The training dataset contains around 2773 participants. The mean of that dataset's outcome variable is 62.6 and the mean of the predicted age is also around the same 62.4. I have used gridsearchCV for hyper parameter tuning.

I am using this trained model on a number of test datasets. The means of the test dataset's outcome variable ranges from 61.6 to around 68.87.

However, for all of these test datasets, the mean of the predicted value converges to around 62.6 (which almost corresponds to the mean of the train dataset).

Is my model overfitting on the train dataset and if so how do I prevent this from happening?

Echo
  • 121
  • 1
    Maybe, or maybe that is the best that your model can do based on your training data. Are your various test sets comparable to your train set? As in are the participants similar in the various sets? – user2974951 May 20 '22 at 10:56
  • I am trying to get accelerated brain aging so the datasets are comparable in terms of age, sex etc but not on the presence of the disease as such. – Echo May 20 '22 at 11:19
  • I wonder if the means you mention can tell much about overfitting vs. underfitting. Variances could be more relevant. – Richard Hardy May 20 '22 at 12:02
  • The variance explained for the validation set is around 70 percent but falls to almost 5-10 % for the test sets having mean actualage to be around 68% and the corresponding predicted age mean to be around 62.6% – Echo May 20 '22 at 12:53
  • How many of the original 2112 features were retained with non-zero coefficients in the final LASSO model? Have you considered using ridge regression instead? – EdM May 20 '22 at 12:58
  • out of the 2112 , 388 were non zeros . I tried using SVR as well , getting same behaviour. – Echo May 20 '22 at 16:53

1 Answers1

2

As an initial guess, overfitting the test data set probably isn't your problem.

For linear models, Statistical Learning with Sparsity (SLS) notes on page 18:

Somewhat miraculously, one can show that for the lasso, with a fixed penalty parameter $\lambda$, the number of nonzero coefficients $k_{\lambda}$ is an unbiased estimate of the degrees of freedom

Your comment indicates that you had 388 nonzero coefficients for 2773 observations. That's about 7 observations per degree of freedom (df). Usual rules of thumb for linear regressions and continuous outcomes suggest that you can avoid overfitting if you have 10-20 cases per df that you use up. So there might be some overfitting, but it doesn't seem enough to explain the results you describe on test data.

To test overfitting of your LASSO fits on training data, you can use bootstrapping. SLS describes how to use that properly for LASSO in Section 6.2. Overfitting of the training set can be evaluated with the optimism bootstrap, in which you repeat the modeling process on multiple bootstrap samples and evaluate the difference in performance of each model between its bootstrap sample and the full training set.

Ridge regression, which keeps all of the predictors but penalizes their coefficients, might work much better. LASSO can work well when only a small subset of predictors are strongly associated with outcome and there aren't other predictors correlated with them. If this is brain imaging or similar data, however, I suspect that there are massive correlations among your 2112 features and that each individually only has a small association with outcome. Try ridge regression, and evaluate its internal performance on the training set as suggested above for LASSO.

I suspect, however, that your problem has more to do with omitted-variable bias; from one of your comments:

the datasets are comparable in terms of age, sex etc but not on the presence of the disease as such.

In linear regression, omitting a predictor that is both correlated with outcome and with included predictors will lead to incorrect assessment of regression coefficients. It sounds like "presence of the disease as such" has those characteristics and isn't included in your model. In that case, your results on test sets might not be so surprising.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Thanks . What I mean by the "presence of disease as such" is the training dataset does not have participants having diabetes whereas the test set are participants having diabetes. I have tried bootstrapping and got 34 features as significant . – Echo May 21 '22 at 18:44
  • 1
    @Echo diabetes would seem to be an important predictor to include in your model, as it probably is associated with your features and it certainly is with age. You might benefit from modeling all the cases together and allow for diabetes status (maybe even pre-test duration of diabetes) as a predictor. Depending on the nature of your test sets, you might even do better by modeling all of the sets together and using the optimism bootstrap unless you have 20,000 or more cases; see this post. And don't ignore ridge, which seems better suited to me. – EdM May 21 '22 at 18:53
  • I tried it with ridge regression as well and found the same issue. The mean of the age in the test set being predicted is around 62.5 even thought the actual mean of the age of test set is 68. – Echo May 25 '22 at 13:04
  • 1
    @Echo the optimism bootstrap can identify if the problem is overfitting. That said, I suspect that there is some critically important variable missing from the model, leading to omitted-variable bias. It's also possible, as a comment on the question suggested, that this is just the best that modeling your data can do. – EdM May 25 '22 at 13:23