7

Consider the following (simplified) example of a project I am working on:

I assume that $X$ has a linear effect on $Y$. However, after plotting the relationship on a scatter plot, it looks like the relationship between $X$ and $Y$ is curvilinear. In particular, the lower and higher levels of $X$ are correlated with lower levels of $Y$, but the middle values of $X$ are correlated with the higher values of $Y$.

One way to interpret this may be that the effect of $X$ on $Y$ is non-linear. However, I have somewhat valid reasons to believe that confounding may be driving this inverse-U-shaped relationship given that there are factors ($Z$) that jointly cause changes in $Y$ and push values of $X$ to their intermediate values.

If I suspect that there is a linear effect, but that there is not a non-linear effect, how should I specify a model testing this? Keeping it simple, assume that I am just going to test this using regression adjustment. If I believe the non-linear correlation is spurious, is it as simple as running the regression and adjusting for confounders without the introduction of a polynomial term? Or should I include the polynomial term in the regression formula anyways even though I don't suspect the causal effect is non-linear despite what the scatter plot between $X$ and $Y$ suggests?

  • 9
    If you are correct in what you think is happening, then the curvilinear effect will disappear when you add the confounders. – Peter Flom Oct 23 '23 at 13:42

3 Answers3

12

Some options:

  1. If you have strong reasons to expect linearity, you can skip the squared term. Be aware that you are making an assumption here though. Examining the residuals of the model can sometimes tell you if there is nonlinearity that you are not accounting for.

  2. Evaluate the evidence for nonlinearity by including the squared term while also including the confounders. As Peter Flom says, including the confounders should substantially weaken any statistical evidence for the squared term.

  3. Use a GAM and allow it to uncover nonlinearity if it is there. Include the confounders in the model, of course. You'll have to think carefully about interactions - whether there is reason to expect any and how best to account for them (can be done but needs more data).

I'd generally be inclined to go with the GAM but it really depends on the specifics of your problem.

The above advice is general and the causal inference aspect makes this more complicated. I would start by drawing the DAG and making explicit your understanding of the causal pathways here.

mkt
  • 18,245
  • 11
  • 73
  • 172
  • 2
    "Evaluate the evidence for nonlinearity by including the squared term" This is poor advice as there are infinitely many non-linear curves with a single local maximum which are not well-represented by a quadratic function. Better tools for modeling nonlinear relationships include a more flexible parametric function, perhaps a piece-wise function, or a smoothing regression such as LOWESS or a GAM as in your third point. GAMs may be specified as semi-parametric models, so the explanatory variables theorized to be causal can be included linearly, while $X$ gets nicely smoothed. – Alexis Oct 24 '23 at 17:29
  • 2
    @Alexis I would agree in most circumstances, but I think a quadratic is reasonable here because OP refers to an "inverse-U-shaped relationship". – mkt Oct 24 '23 at 20:07
  • 3
    That's fair. An examination of the shape of the letter 'U' in different typefaces will reveal that there are many possible geometries for 'inverse-U-shaped'… even without getting into asymmetries! I've encountered many empirical non-linear relationships which could be similarly described, but only once have I found one which could be reasonably estimated using a quadratic function (and that one also had theoretical reasons for appearing quadratic). – Alexis Oct 24 '23 at 20:12
  • @Alexis I agree with that and would be happy to upvote an answer expanding on that point, should you feel like writing one. I will also add a small edit about this later. – mkt Oct 25 '23 at 06:57
9

My answer is a bit of a combination of Peter and mkt's advice. Basically, if you are worried about confounding, you should directly model it. And if you suspect your data is not linear, then again, you should model it that way with the understanding that there may be structure in the data, and with this mgcv may be your friend.

Take this scatterplot for example. We can see some portion of it is relatively flat/linear while another portion of it has an almost exponential relationship behind it:

enter image description here

Fitting a GAM to this data may in some sense capture some of the nonlinearity in the data, but most of that is being obfuscated by something behind the data generating process behind this. We can see that with how bizarrely it draws a line through the empty space here.

enter image description here

Well there is some omitted variable bias going on here. In actuality, this data was generated with two factors, the simulated data and initial fit found below:

#### Load packages #### 
library(mgcv)
library(tibble)

N and Scale

set.seed(123) n <- 100 scale <- 2

Create X and F

x <- runif(n, 0, 1) f1 <- function(x, b, c = 1) exp(b * x) * c f2 <- function(x, b, c = 1) exp(b * x) * c f1_x <- f1(x, b = 4) f2_x <- f2(x, b = 2) e <- rnorm(n, 0, scale)

Create Factors

fac <- as.factor(sample(1:2,n,replace=TRUE)) fac1 <- as.numeric(fac==1) fac2 <- as.numeric(fac==2)

Estimate Y

y <- x + f1_xfac1 + f2_xfac2 + e

Merge Data

df <- tibble(y = y, x = x, fac = fac) df

Plot

plot(x,y, pch=23, main="Simulated Data", xlab="X", ylab="Y", cex=1.5)

Fit Model

fit <- gam( y ~ s(x), data = df, method = "REML" )

Add Lines to Plot

newdata <- data.frame( x = seq( min(x),max(x),length.out=100 ) )

pred <- predict(fit, newdata=newdata, se=T)

lines(newdata$x, pred$fit) lines(newdata$x, pred$fit + pred$se.fit, lty="dashed") lines(newdata$x, pred$fit - pred$se.fit, lty="dashed")

Now you may notice there is actually a factor variable here which may be useful to visualize:

#### Plot ####
plot(x,y,
     bg=c("gray","black")[fac],
     pch=23,
     main="Simulated Data",
     xlab="X",
     ylab="Y",
     cex=1.5)

enter image description here

We can see now where the splitting is coming from. There are two distinct groups, and only one is linear while the other is nonlinear. We can model this directly in mgcv and plot the fitted lines like so:

#### Second Fit ####
fit2 <- gam(
  y 
  ~ fac
  + s(x, by = fac),
  data = df,
  method = "REML"
)

New Pred Lines

newdata2 <- data.frame( x = seq( min(x),max(x),length.out=100 ), fac = 1 )

pred2 <- predict(fit2, newdata = newdata2, se = T)

newdata3 <- data.frame( x = seq( min(x),max(x),length.out=100 ), fac = 2 )

pred3 <- predict(fit2, newdata = newdata3, se = T)

Plot

plot(x,y, bg=c("gray","black")[fac], pch=23, main="Simulated Data", xlab="X", ylab="Y", cex=1.5)

Fac 1 Lines

lines(newdata2$x, pred2$fit) lines(newdata2$x, pred2$fit + pred2$se.fit, lty="dashed") lines(newdata2$x, pred2$fit - pred2$se.fit, lty="dashed")

Fac 2 Lines

lines(newdata3$x, pred3$fit) lines(newdata3$x, pred3$fit + pred3$se.fit, lty="dashed") lines(newdata3$x, pred3$fit - pred3$se.fit, lty="dashed")

Now we have clearly demonstrated that the entirety of this relationship is based on the factor variable which confounded the initial results:

enter image description here

0

As you probably know, a linear, or first-order, dependency between $X$ and $Y$ can be detected by looking at the Pearson correlation between the two variables,

$$\rho_{X,Y}=\operatorname{corr}(X, Y) = \frac{\operatorname{cov}(X, Y)}{\sigma_X\sigma_Y}.$$

If this is close to zero, there is no significant linear dependency between $X$ and $Y$, and as you may know, the value of the Pearson correlation will be in the range $[-1, 1]$, and values close to -1 or close to 1 indicate a very strong linear dependency (where in the extreme case all data points lie on a line).

To detect higher order dependencies between $X$ and $Y$, you could look at the correlation between higher order polynomials of $X$, and $Y$. Inspired by how you calculate standardized moments of a distribution, you could define

$$\rho^n_{X,Y}=\operatorname{corr}((X-E[X])^n, Y),\qquad n\in\mathbb{N}$$

(where $\rho_{X,Y} = \rho^1_{X,Y}$). What you could do is to look at this value for the few first integers $n \geq 2$, and unless all of those are also close to zero, you have a non-linear dependency between $X$ and $Y$.

Note however that while completely independent variables $X$ and $Y$ will have $\rho^n_{X,Y}=0$ for all $n\in\mathbb{N}$, the opposite is not true: If $\rho^n_{X,Y}=0$ for all $n\in\mathbb{N}$, $X$ and $Y$ can still be statistically dependent variables. This is for example the case when the point $(X,Y)$ is a stochastic variable uniformly sampled from the unit circle or from the unit disc. A natural generalization of $\rho^n_{X,Y}$ that could be used to detect the dependency between $X$ and $Y$ in more cases is

$$\rho^{m,n}_{X,Y}=\operatorname{corr}((X-E[X])^m, (Y-E[Y])^n),\qquad m,n\in\mathbb{N}$$

where for example $\rho^{2,2}_{X,Y} = -1$ for $(X,Y)$ sampled from the unit circle, while $\rho^{m,n}_{X,Y}$ is still zero for all $m,n\in\mathbb{N}$ and statistically independent $X$ and $Y$.