1

Suppose I have this model:

$$y = \beta_1x + \beta_2x^2 + \epsilon$$

I would like to fit it using OLS. In my data the correlation between $x$ and $x^2$ is $0.91$. After I rescale $x$ to zero mean and unit variance, the correlation is still $0.88$.

I'm worried that the model will be unstable, in the sense that a small shift in data will cause large changes in the coefficients. Is there anything I can do?

badmax
  • 2,211
  • 1
    Yes: Use orthogonal polynomials which are uncorrelated by design. You'll find a number of posts on this site on them. – COOLSerdash Dec 13 '21 at 21:10
  • 1
    You have no problem here: correlations of $0.88$ and $0.91$ are not going to create issues. The reason comes down to a natural hierarchy of the variables $x$ and $x^2:$ you will either consider both of them together or just $x$ alone, but not $x^2$ alone. See https://stats.stackexchange.com/questions/304831 and https://stats.stackexchange.com/questions/28730 for extended explanations and discussion. – whuber Dec 13 '21 at 22:34

1 Answers1

0

You might wish to consider doing a sensitivity analysis. In your case, simulating the addition of noise into your data followed by training your model on the noised data will let you see how different your parameters are when your data is slightly different. Repeating this process many times will allow you to estimate a histogram of parameter values under a choice of noise model (e.g. add IID standard normal variables to your data).

What if the parameters don't change much? For purely stability purposes, this is great news.

For inferences about the parameters, you'll need to look a little further. You might find that the parameters are not statistically significant even though you achieve good predictive error. What might be going on in that case?

When your predictors are highly correlated, they may also be multicollinear. Multicollinearity can lead to variance inflation of the standard errors of those parameters, which is a problem for false positive rates in hypothesis testing about those parameters' statistical significance. See this list of potential "remedies" to multicollinearity if you are interested in inferences about the parameters.

Galen
  • 8,442