4
  1. I remember reading that for a certain kind of splines, all basis functions have to be included when using feature selection methods such as stepwise regression or shrinkage.
    The argument for that was that without including all basis functions the splines could not be constructed, hence, selecting only parts of the basis function would be wrong.

  2. A similar argument might hold for selecting parts of one-hot-encoding nominal features, although I cannot think for a good reason why one should not be able to just "throw" certain levels "away".

Now I'm looking for a reference to read up on this again.

Dave
  • 62,186
Ben373
  • 43
  • 8

2 Answers2

6

Just as it is not appropriate to not let $x$ and $x^2$ terms be dropped in a quadratic regression, it is important to treat all basis functions as a block. If the sample size doesn't support estimating all the parameters involved, then use penalization/shrinkage/regularization to effectively give the block of basis functions fewer degrees of freedom then are apparent.

Also I'd say that dropping levels of a categorical predictor is problematic when using supervised learning (i.e., when utilizing $Y$). This will result in biased (low) standard errors and poor frequentist operating characteristics. Like nonlinear basis functions it's better to penalize all the indicator variables involved in a multi-level categorical predictor than to drop some of the indicators.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
1

For #2, I can think of two arguments. Unfortunately, they contradict each other.

  1. The feature is the entire factor variable, and coding the way we do is just a technicality for doing the math.

  2. You can throw out levels. If your factor is dog/cat/alligator, maybe we can regard dogs and cats as the same and think of the animals as mammal/reptile, and removing the variable helps model performance by lowering the variance (fewer parameters in the model) with only a small hit to bias, as dogs and cats are very similar.

I don’t have articulate rationale right now, but I can kind of see analogous arguments for the splines.

I tend to side with the first argument and even posted something along those lines last year.

However, feature selection is notoriously unstable, so all of this is a bit dubious.

Your bounty mentions wanting a reputable source. Try Frank Harrell’s “Why R?” talk or his Regression Modeling Strategies book for discussions about feature selection stability.

Dave
  • 62,186
  • Changing the groupings is a form of HARKing and so, at a minimum, needs to be done transparently. Even when one is interested only in prediction, this creates yet one more mechanism for overfitting. As a practical matter, I find it more useful to explore a series of splines of graduated complexity (such as increasing the number of knots until nothing more is gained), perhaps using a Lasso to suggest the spline complexity. Of course all terms in the selected spline are retained. – whuber Mar 14 '22 at 18:28