1

The goal of my project is to reduce a metabolomics dataset of around 1000 to sets of varying size in order to attempt to identify reduced feature sets that can serve as biomarkers for predicting changes in plasma markers in response to an intervention.

The fewer the biomarkers, the better, though there will obviously be some trade-off to be made here, hence the desire to test feature sets of various sizes.

For this, one approach we will use is Lasso.

A standard Lasso procedure for feature selection might involve:

  • Finding the optimal λ penalization value (e.g., based on prediction error)
  • Applying this to the training data
  • Evaluating the non-zero features, leaving you with a reduced feature set

In my case, I'd like to use Lasso for feature selection, but instead of optimizing λ and assessing non-zero features, I'd like to limit Lasso to use reduce the feature set to a specified number (e.g., be left with only 5 features after λ optimization). This allow us to test the predictive ability of feature sets of varying sizes.

My questions:

  1. Is there a way to implement this in Python? Ideally in sklearn but I'm also open to other options.

  2. Is such an approach even necessary, and could I not instead just do the standard procedure I list above, select only the top X best features, and then use these in subsequent regressions? Although I am aware that doing so will inflate beta-coefficients and shrink p-values (see Harrell's answer), I don't think this is a problem in my situation since I don't really care about interpretability of the regression output; my main concern is that my reduced feature set can actually predict changes in the plasma marker of interest and is a reasonable size to have practical value.

JED HK
  • 409
  • You find the model with the best performance among those with an acceptable number of nonzero features. Is there something more to your question than this? // Once you do this, you might be interested in bootstrapping your data and doing this many times. This is likely to reveal instability in the selected features. // Note that selecting features via LASSO and then running OLS on those features will result in different estimates than the LASSO will give those features. Is there a reason you find this desirable? – Dave Nov 13 '22 at 20:00
  • Sorry @Dave, I will try to clarify further: "You find the model with the best performance among those with an acceptable number of nonzero features" - Agreed, usually this is how Lasso for feature selection would go. What I'd like to know is 1) can I instead make it so that Lasso can only select X number of feature, and 2) WRT: "Note that selecting features via LASSO and then running OLS on those features will result in different estimates than the LASSO will give those features", I think this might not be an issue in the context of my objectives. Am I correct in thinking this? Thanks Dave :) – JED HK Nov 14 '22 at 07:47
  • The math would be to consider an additional constraint in the optimization problem that only a certain number of parameters would be nonzero. I do not know how to implement this in any software, however. 2) What are your objectives?
  • – Dave Nov 14 '22 at 15:13
  • Me neither @Dave - I think I will stick with option 2 that I list. My objective is to find a reduced set of metabolites that are predictive of changes in plasma markers in response to an intervention. Thus, even if my coefficients and p-values are distorted, as long as predictive accuracy is high I don't think this matters too much. – JED HK Nov 16 '22 at 08:19