Setup
- Task: binary classification
- Models: logistic regression, SVM, ELM, neural networks - anything that can do classification
- Dataset: 10 basic features + 6 my own features
Question
How do I see whether training some model with my 6 new features results in better performance (say, higher accuracy) than training the same model without these features?
EDIT: whenever I say "compare scores" or something like this, I mean "compare scores using out-of-sample/holdout/unseen data". In all 3 points described below, I mean that all comparisons are performed on a holdout set or using cross-validation.
Ways of doing this
- The simplest way of doing it is to train a "basic" model using just the 10 basic features, and another model - using all 16 features. Then compute and compare their scores on the respective holdout sets or use cross-validation.
- However, the model fitted using 16 features will have more parameters than the basic model. Thus, it could be more "powerful" than the basic model, simply because it has more parameters. This seems especially true for neural networks, where the basic model will have the first layer's weight matrix of shape, say,
(45, 10), while the second model's matrix will have shape(45, 16), and elements of these matrices will interact with all features at once. - Thus, I will be comparing a bigger and potentially more powerful model to a smaller one, so I won't be able to tell why the second model outperforms the basic one: is it because the model is more complex? or is it because my features help?
- Or is this concern safe to ignore?
- However, the model fitted using 16 features will have more parameters than the basic model. Thus, it could be more "powerful" than the basic model, simply because it has more parameters. This seems especially true for neural networks, where the basic model will have the first layer's weight matrix of shape, say,
- I can do several things to ensure that both models have the same number of parameters:
DatasetBasic = 10 basic + 6 zerosandDatasetMine = 10 basic + 6 mine, so essentially I'll be comparing performance of models fitted to some data and zeros vs fitted to the same data and my new features. The number of features in both datasets is the same, so the number of parameters in both models will be the same too.- Or, similarly, use noise instead of zeros:
DatasetBasic = 10 basic + 6 random noiseandDatasetMine = 10 basic + 6 mine - In general, compare 6 nonsensical features to the 6 features I created and argue that my features result in better scores than the nonsensical ones.
- Does it make sense to compare my features against noise or zeros? Won't the presence of noise "confuse" the model and result in poor performance, so I'll end up comparing to an a priori bad ("confused") model?
- Also, the accuracy of my SVM model sometimes increases when I add a randomly generated feature, even though the feature is clearly nonsensical. Thus, according to the model, the random feature is not nonsensical, even though I specifically created it to be useless.
- I can also use feature importance and argue that since the importance of my features is positive, they indeed increase the score of the model.
- However, when computing feature importances, I train the model using
DatasetMine = 10 basic + 6 mine, so I'm using my features anyway. But I'd like to see whether adding (somehow; I don't really know how to do this properly) my features to the training set improves performance, not whether permuting these features in the validation set affects something or not.
- However, when computing feature importances, I train the model using
What are some standard, well-known, widely used, go-to methods of testing whether my features increase the performance of a particular model?
It's absolutely fine to compare two models of very different power/complexity's performance, provided they are being compared on a test set which was not seen during training.
Things might look a little different if you think that in production, your data distribution might drift from your training distribution. In this case, you might still want to penalise more complex models. Even so, this would be a fudge factor, would be better to do time-sensitive cross-validation to quantify this effect.
– gazza89 Jan 11 '24 at 11:26