4

I understand that in LASSO/Ridge it is best practice to scale covariates so that no single covariate dominates the penalized norm. However, when entering interaction terms, it is unclear whether only the constituent terms should be scaled, or the interaction should be as well. More formally,

Let be $x$ a vector of observations on a covariate and $s()$ be some scaling function, e.g., $s(x) = (x - \bar x)/\sigma_x$. Then:


Method 1: Scale constituent variables only

$\hat E(y) = \hat\beta_1s(x_1) + \hat \beta_2s(x_2) + \hat\beta_3[s(x_1) * s(x_2)]$


Method 2: Scale constituent variables, and interaction

$\hat E(y) = \hat\beta_1s(x_1) + \hat \beta_2s(x_2) + \hat\beta_3s(x_1 * x_2)$


As I see it, Method 1 has the issue that the interaction term is no longer guaranteed to have mean-zero and unit variance, and therefore its coefficient ($\hat\beta_3$) has the potential to be overly- or underly-important in the penalized norm. However, Method 2 has absolutely no interpretability. E.g., $\partial E(y)/\partial s(x_1) \neq \hat\beta_1 + \hat\beta_3 s(x_2)$ and it is not immediately clear what the correct derivative/interpretation is.

So I suppose I'm wondering what people tend to do in this scenario, and why?

John
  • 165

1 Answers1

0

You are correct that Method 2 isn't readily interpretable. Something akin to Method 1 is called for.

Don't be trapped by the idea that all predictor coefficients need to be penalized to the same amount. Once you include binary or categorical predictors in the model, it's not even clear how to do that. You can make a choice about how much to penalize an interaction coefficient versus the individual coefficients for the predictors involved in the interaction. Software can allow for differential penalization of predictors, for example via the penalty.factor argument to the R glmnet() function. Apply your knowledge of the subject matter and your understanding of the distribution of the interaction products to make a reasonable choice.

If you are using variable selection via LASSO or elastic net, do make sure that any individual coefficients for predictors in maintained interaction terms are also kept in the model. See Example 4.3 in Statistical Learning with Sparsity.

EdM
  • 92,183
  • 10
  • 92
  • 267