Inclusion of polynomial term in multiple linear regression

Question

I want to predict risk perceptions with conspiracy beliefs and political orientation. Theoretically, I do not assume that political orientation is quadratically related to risk perceptions. My data confirms this expectation. Thus, this is my current regression model (simplified):

risk perceptions ~ conspiracy beliefs + political orientation

However, previous evidence suggested that political orientation is quadratically related to conspiracy beliefs, with greater beliefs at both ends of the political spectrum. One should note that this relationship does not hold true for my data (I have checked the bivariate plot and residuals vs. fitted values plot). Would you nonetheless recommend modifying my regression model to look like this:

risk perceptions ~ conspiracy beliefs + political orientation + political orientation^2

Thanks for your help!

Can you say more about how you collected data and how you are going to use the fitted model (for prediction)? To build a model which allows for nonlinearity if it's supported by the data, without making assumptions about the form of the nonlinearity, you can use splines. In fact, why not spline both conspiracy_beliefs and political_orientation? — dipetkov, Jul 18 '22 at 10:21
@dipetkov: Peter helped me solve the issue quite well already, and I guess your suggestion (i..e, using splines) goes in the same direction. Thanks! — lina31399, Jul 18 '22 at 10:27

Peter · Accepted Answer · 2022-07-18T14:31:07.050

3

I suppose you use a ordinary least square (OLS) model to find the average effect of some $x$ on some outcome $y$? In order to see if some additional variable (or transformation) like $x^2$ benefits the model, you could a) look at the p-value of the variable(s) in question and b) inspect AIC, BIC (and possibly adjusted $R^2$) to see if the additional (quadratic) term improves model fit. Note that a non-significant p-value of the quadratic term alone does not imply that the variable is not useful!

If you find no evidence based on descriptive statistics, p-values, AIC, BIC that including $x^2$ will be beneficial, you have good reasons to claim that this effect is negligible in your case (and that excluding some term does not cause underspecification).

However, since quadratic terms are often a crude approximation of non-linear effects, you might inspect the data generating process a little further (in a multivatiate setting). You could use generalised additive models (GAM) to test for non-linear effects (without making assumptions about parameterization). See ISL, Chapter 7.

Find a minimal example with simulated data here. In the example (see figure below), the GAM model (black, dashed line) is able to approximate a non-linear function quite (red line) well, where a linear (blue line) or quadratic parameterization (not shown) would fail (more or less) to fit the data well.

GAM (with locale regression or splines) are not easily interpreteable (like OLS coefficients would be, see also EdM's comment below). However, you could inspect your DGP in detail using GAM and look for a good approximation for the data (or possibly find good reasons to include a quadratic term or not in your OLS model).

edited Jul 18 '22 at 14:31

answered Jul 18 '22 at 09:35

Peter

256

Thanks a lot for your help and advice regarding the GAM, Peter! – lina31399 Jul 18 '22 at 10:23
You have lots of good advice but why mention causality and then qualify it with quotes? The OP hasn't explained how they collected the data and their data doesn't agree with the theory based on previous research, so we can guess the data is observational and possibly biased. It might be better to not mention causal or "causal" and stick to discussing modeling associations observed in the data. – dipetkov Jul 18 '22 at 10:41
1

@dipetkov There is a distinction between "predictive" and "causal" models (at least in some disciplines such as econometrics). Causality often simply is a claim made on theoretical grounds. Sometimes I'm a little sceptical that these claims hold so I used quotes. However, maybe it is confusing, so I remove this reference – Peter Jul 18 '22 at 10:49
1

A restricted cubic regression spline can provide an interpretable smooth fit (i.e., with an explicit equation, albeit not very simple) for a continuous predictor. See the rcs() function in the rms package for an implementation that allows for tests of nonlinearity and an explicit equation. You can ask for cubic regression splines as a GAM in the R mgcv package, but I don't know if that can provide an associated equation. – EdM Jul 18 '22 at 13:54
@EdM: Interesting, I will have a look at this. – Peter Jul 18 '22 at 14:29
They are explained before GAM in Chapter 7 of ISL. Note the distinction between regression splines with fixed knot positions (Section 7.4) and smoothing splines (Section 7.5). Smoothing splines involve penalization of multiple coefficients. Regression splines work directly with the analyst's choice of the number and position of knots, allowing for unpenalized coefficient estimates. The regression-spline basis functions used by rcs() seem more amenable than other implementations to providing a (somewhat) interpretable equation. – EdM Jul 18 '22 at 15:17

Inclusion of polynomial term in multiple linear regression

1 Answers1

Linked