The goal of my project is to reduce a metabolomics dataset of around 1000 to sets of varying size in order to attempt to identify reduced feature sets that can serve as biomarkers for predicting changes in plasma markers in response to an intervention.
The fewer the biomarkers, the better, though there will obviously be some trade-off to be made here, hence the desire to test feature sets of various sizes.
For this, one approach we will use is Lasso.
A standard Lasso procedure for feature selection might involve:
- Finding the optimal λ penalization value (e.g., based on prediction error)
- Applying this to the training data
- Evaluating the non-zero features, leaving you with a reduced feature set
In my case, I'd like to use Lasso for feature selection, but instead of optimizing λ and assessing non-zero features, I'd like to limit Lasso to use reduce the feature set to a specified number (e.g., be left with only 5 features after λ optimization). This allow us to test the predictive ability of feature sets of varying sizes.
My questions:
Is there a way to implement this in Python? Ideally in
sklearnbut I'm also open to other options.Is such an approach even necessary, and could I not instead just do the standard procedure I list above, select only the top X best features, and then use these in subsequent regressions? Although I am aware that doing so will inflate beta-coefficients and shrink p-values (see Harrell's answer), I don't think this is a problem in my situation since I don't really care about interpretability of the regression output; my main concern is that my reduced feature set can actually predict changes in the plasma marker of interest and is a reasonable size to have practical value.