How to tell if my features improve model performance?

Question

Setup

Task: binary classification
Models: logistic regression, SVM, ELM, neural networks - anything that can do classification
Dataset: 10 basic features + 6 my own features

Question

How do I see whether training some model with my 6 new features results in better performance (say, higher accuracy) than training the same model without these features?

EDIT: whenever I say "compare scores" or something like this, I mean "compare scores using out-of-sample/holdout/unseen data". In all 3 points described below, I mean that all comparisons are performed on a holdout set or using cross-validation.

Ways of doing this

The simplest way of doing it is to train a "basic" model using just the 10 basic features, and another model - using all 16 features. Then compute and compare their scores on the respective holdout sets or use cross-validation.
- However, the model fitted using 16 features will have more parameters than the basic model. Thus, it could be more "powerful" than the basic model, simply because it has more parameters. This seems especially true for neural networks, where the basic model will have the first layer's weight matrix of shape, say, (45, 10), while the second model's matrix will have shape (45, 16), and elements of these matrices will interact with all features at once.
- Thus, I will be comparing a bigger and potentially more powerful model to a smaller one, so I won't be able to tell why the second model outperforms the basic one: is it because the model is more complex? or is it because my features help?
- Or is this concern safe to ignore?
I can do several things to ensure that both models have the same number of parameters:
- DatasetBasic = 10 basic + 6 zeros and DatasetMine = 10 basic + 6 mine, so essentially I'll be comparing performance of models fitted to some data and zeros vs fitted to the same data and my new features. The number of features in both datasets is the same, so the number of parameters in both models will be the same too.
- Or, similarly, use noise instead of zeros: DatasetBasic = 10 basic + 6 random noise and DatasetMine = 10 basic + 6 mine
- In general, compare 6 nonsensical features to the 6 features I created and argue that my features result in better scores than the nonsensical ones.
- Does it make sense to compare my features against noise or zeros? Won't the presence of noise "confuse" the model and result in poor performance, so I'll end up comparing to an a priori bad ("confused") model?
- Also, the accuracy of my SVM model sometimes increases when I add a randomly generated feature, even though the feature is clearly nonsensical. Thus, according to the model, the random feature is not nonsensical, even though I specifically created it to be useless.
I can also use feature importance and argue that since the importance of my features is positive, they indeed increase the score of the model.
- However, when computing feature importances, I train the model using DatasetMine = 10 basic + 6 mine, so I'm using my features anyway. But I'd like to see whether adding (somehow; I don't really know how to do this properly) my features to the training set improves performance, not whether permuting these features in the validation set affects something or not.

What are some standard, well-known, widely used, go-to methods of testing whether my features increase the performance of a particular model?

I'm not sure why you don't like the "simplest way of doing this"
It's absolutely fine to compare two models of very different power/complexity's performance, provided they are being compared on a test set which was not seen during training.

Things might look a little different if you think that in production, your data distribution might drift from your training distribution. In this case, you might still want to penalise more complex models. Even so, this would be a fudge factor, would be better to do time-sensitive cross-validation to quantify this effect. — gazza89, Jan 11 '24 at 11:26

Dave · Answer 1 · 2022-04-29T13:31:39.970

This is what an out-of-sample test set reveals. In fact, machine learning tends not to care much about in-sample (“training set”) performance, since you can play connect-the-dots and memorize the data, but that models coincidences in the data (noise) instead of the true trend (signal). When you say that your SVM model has its performance improve when you add a feature of pure randomness, I suspect that you will see a drop in performance if you assess that on some out-of-sample data.

Particularly for generalized linear models like logistic regressions, you might also be interested in various information criteria like Akaike (AIC) and Bayes (BIC) that penalize the model for having many parameters in order to keep the higher-parameter model from having an advantage.

This answer of mine discusses how feature removal may or may not result in improved importance and is worth a read, even if it is somewhat tangential.

Finally, “accuracy” is a surprisingly poor performance metric. I will include the usual links I post on this topic.

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

https://www.fharrell.com/post/class-damage/

https://www.fharrell.com/post/classification/

https://stats.stackexchange.com/a/359936/247274

Proper scoring rule when there is a decision to make (e.g. spam vs ham email)

Why is it that if you undersample or oversample you have to calibrate your output probabilities?

https://twitter.com/f2harrell/status/1062424969366462473?lang=en

"assess that on some out-of-sample data" - maybe I didn't spell it out well enough, but I'm already doing all comparisons using out-of-sample data. Of course, since I'm interested in some kind of predictive performance, I should be testing on holdout (previously unseen) data. That's what I'm already doing in all 3 cases I described. "I suspect that you will see a drop in performance if you assess that on some out-of-sample data" - I was expecting the same, but I sometimes got increases in performance anyway... — ForceBru, Apr 29 '22 at 13:31
As some further clarification, the question isn't about testing model performance in general, but more concerned with testing the effect of my new features on some metric of (out-of-sample) performance, be it accuracy, AUC-ROC or whatever else. — ForceBru, Apr 29 '22 at 13:43

How to tell if my features improve model performance?

Setup

Question

Ways of doing this

1 Answers1