3

I want to perform feature selection on my data. I have too many features, about 50 - 60, for not so much samples.

Until today I was using the importance function of the xgboost package, but lately I was introduced to SHAP, and although it is more of an interpretation tool than anything, I was told it is a powerful and robust tool for feature selection, since it uses some advanced ideas from the game theory and stuff like that.

I wanted to make sure I'm on the right track. my code works fine, if it helps in any way, here is the code:

 compute_and_filter_features = function(k_top, filter_features=TRUE, model = initial_model, train.x = train_x, test.x = test_x){
    # Compute SHAP values
    shap_values = SHAPforxgboost::shap.values(xgb_model = initial_model, X_train = data.matrix(train_x))
    shap_values$shap_score = as.data.frame(shap_values$shap_score)
feature_importance = colSums(abs(shap_values$shap_score))

sorted_features = sort(feature_importance, decreasing = TRUE, index.return = TRUE)
top_features = names(feature_importance)[sorted_features<span class="math-container">$ix[1:min(length(sorted_features$</span>ix), k_top)]]

if (filter_features == TRUE) {
  train_x_filtered = train_x[, top_features]
  test_x_filtered = test_x[, top_features]
} else {
  train_x_filtered = train_x
  test_x_filtered = test_x
}

View(train_x_filtered)
View(test_x_filtered)

return(list(train_x_filtered=train_x_filtered, test_x_filtered=test_x_filtered, top_features = top_features, feature_importance = feature_importance))

}

  • 1
    What is your statistics question? – Dave Jul 10 '23 at 13:13
  • This is for a classifier model (xgboost). My goal is keeping only the informative features for predicting, that's all. @Dave – Programming Noob Jul 10 '23 at 13:31
  • So what question do you have? Cross Validated is not about code verification or debugging. – Dave Jul 10 '23 at 13:32
  • code verification or debugging is not my question. My question : is SHAP a robust way to preform feature selection? more specially, the SHAPforxgboost package in R. I can't find a straight answer for this anywhere .. – Programming Noob Jul 10 '23 at 13:41

1 Answers1

3

SHAP probably is not as useful as you would like.

In a keynote address to "Why R?", Frank Harrell discusses how feature selection is a mirage. While his simulations do not address SHAP in particular, they demonstrate considerable instability in selecting features. Later comments of his in the presentation, in response to a question about LIME and SHAP, lead to one of my favorite quotes.

Whatever method you're using, if you're afraid to calculate confidence intervals for it, you shouldn't use it.

In the presentation, he mentions bootstrap confidence intervals as a possible approach, particularly bootstrapping the ranks of feature importance.

If you do this for your models and find that you reliably identify the important features and screen-out the unimportant features, that seems like a positive sign (maybe not definitive, but it's at least a positive sign). In a situation like you seem to be in where you have a fairly small number of observations relative to the number of features, you probably do not have enough data to make such reliable claims. That is, I suspect the feature selection based on SHAP importance to be unstable, and with the selected features bouncing all over the place as you make changes to the data (which will be the case when you go predict on new data), there is justifiable doubt that the variables selected based on the training data will be the right variables for making predictions on new data (almost the verbatim phrasing I used in the context of stepwise selection in March).

Finally, I have heard of bizarre attempts of feature selection where a hugely nonlinear model like yours is used to select the features for a (generalized) linear model (such as a logistic regression). Given that a model like XGBoost will model nonlinear and interaction terms that are not part of a (generalized) linear model unless you explicitly code them to be, even stable feature selection for an XGBoost model might lead to silly features being selected for a (generalized) linear model. The XGBoost might find that a feature is important because of a quadratic relationship with the outcome, but if your (generalized) linear model only uses the linear relationship and not the squared term, such a feature might be quite worthless (think about fitting a line to a parabola). You have not said that this is part of your plan, but I include it as a warning to other readers as well as to you, in case your goal is to run a (generalized) linear model on the selected features due to a lack of data.

Dave
  • 62,186
  • I see. so using SHAP could yield unstable and potentially misleading results. Knowing that feature selection can often be unstable, is there any approach that you recommend that might be better than SHAP and xgb.importance (a non-linear approach cause obviously I'm using non-linear models). – Programming Noob Jul 13 '23 at 10:47
  • My suggestion is to quantify the uncertainty in whatever feature selection technique you use. If you’re afraid to calculate a confidence interval because it might expose considerable uncertainty, then you should not be using that technique. On the other hand, if such a confidence interval shows the technique to be stable (which I doubt for the amount of data to which you’ve alluded, but it’s is largely an empirical question), you can have more faith in such an approach. – Dave Jul 13 '23 at 11:00
  • Thanks! one final question please .. does lightGBM even need feature selection? it should be doing it's own feature selection with pruning right ? – Programming Noob Jul 13 '23 at 12:41
  • Your follow-up questions warrant their own posted questions. Remember that Cross Validated is strictly Q&A, not a discussion forum. – Dave Jul 13 '23 at 12:43
  • 1
    Cool, thank you! – Programming Noob Jul 13 '23 at 13:50