3

In this answer the author shows using ANOVA to compare two models and use the likelihood ratio test to decide keep the rank variable.

The likelihood ratio test is highly significant and we would conclude that the variable rank should remain in the model.

My question is has ANOVA be used for feature selection in large scale? If not why?

Haitao Du
  • 36,852
  • 25
  • 145
  • 242

2 Answers2

1

This has issues but could work if you are careful.

On the one hand, this is a form of stepwise regression, which has major drawbacks. In particular, all standard downstream inference is tainted by doing this. If you fit a model, remove insignificant features, and then fit another model on just the significant features, the p-values and confidence intervals lack their standard meaning, as they are calculated without considering the earlier step of feature selection. Even an in-sample measure of model performance like adjusted $R^2$ winds up biased high, since the model degrees of freedom does not account for the variable selection.

For a reference, Frank Harrell discusses a number of issues with stepwise variable selection here. The content deals with the mathematics, not the software implementation, so the reference applies whether you use Stata or not.

Further, variable selection is notoriously unstable. If you do cross-validation or bootstrap your data set, you are likely to see selected features come and go.

Finally, what do you do when you fit a model, remove insignificant features, fit a new model on just the significant features, and get that some of those features are insignificant? Do you keep removing insignificant variables? Do you even trust the significance, in light of the above discussion about p-values and confidence intervals lacking their usual meaning?

However, if you do an out-of-sample validation, rather than relying on in-sample measures like the high-biased adjusted $R^2$, stepwise selection can be competitive with other predictive modeling strategies.

Dave
  • 62,186
1

As you mentioned, anova is just conducting likelihood ratio test for you under the hood. This is, of course, a "valid" method for variable selection, but fails to be a "good" method in the large scale setting.

Variable selection methods based on testing mostly fall into the category of "Subset selection", which is usually not favored for high-dimensional data. First, they are usaully conducted in a "stepwise" manner, meaning only examining the significance of one variable at a time. When there are a large number of features, this is clearly inefficient. Second, these methods usually involve fitting model multiple times, one time for each newly selected subset. When the data set is too large to be fitted in acceptable amount of time, the whole variable selection procedure can be incredibly time-consuming.

For high-dimensional data, shrinkage methods(or "regularization methods", "sparse methods") are preferred. In regression setting, some of the classic examples are Ridge regression, LASSO and Elastic Net. These methods are far more efficient, and can produce models with great generalization abilities. Whether you are modeling for prediction or inferece, these methods are definitely worth trying.