How can delta (Δ) R² for a term be larger than R² for that term?

Question

I am testing whether a variable predicts an outcome - firstly on its own, and then after controlling for various covariates. Since the outcome is continuous, I want to report R² as an effect size metric.

For that, I was planning to report

R² for the DV ~ IV model, and
ΔR² comparing DV ~ IV + CV minus DV ~ CV.

In some instances, the resulting ΔR² is larger than the corresponding R² - which does not seem to be possible with my understanding of what R2 indicates. How can a variable explain a greater incremental than total share of variance? I'd be very grateful for any pointers.

Is $\Delta R^2$ the difference in $R^2$ between the larger and the smaller model? — Stephan Kolassa, Nov 29 '22 at 16:41
Do you include intercept terms in these models or not? Regardless, consider the case where the DV is a linear combination of CV and IV but CV is orthogonal to DV: your $\Delta R^2$ would be 100%. Here's an example with four observations: IV = (-1,0,1,0); CV=(-0.9,-0.1,1.1,-0.1), and DV=(0.9,-1.01,0.98,-1.08). The R^2 for DV~IV is 0.00; the difference in R^2 for DV~CV and DV~IV+CV is 0.97. (All models include intercepts.) — whuber, Nov 29 '22 at 16:54
Yes, I include intercepts. Thanks for that example, very helpful. So far, I've always been thinking about shared variance that is then parcelled out - and my intuition for this case is still not clear; some more thinking to do. — Lukas Wallrich, Nov 29 '22 at 17:35
You cannot decompose the total variance into variance components uniquely unless the explanatory variables are uncorrelated. The (strong) correlation among IV and CV in the examples is what makes them work. — whuber, Dec 01 '22 at 15:36

score 1 · Answer 1 · answered Nov 29 '22 at 17:03

First, to get this nitpick out of the way: you are not predicting, you are fitting and explaining variance. These two are very much not the same. One model might explain much more variance (and even have a higher adjusted $R^2$) but yield worse predictions than a different model. In-sample fit is a notoriously bad guide to out-of-sample predictive power. Analyzing model fit and calling it "prediction" confuses two very different things, and every time I see a psych paper discussing "prediction", I am sorely tempted to ask them to actually do predict something and compare these predictions to a much simpler model on a true holdout sample. They would usually be shocked how little their statistically significant findings actually improve predictive power.

Now, this is just a situation where your intuition (that a variable should not explain more variance in the presence of a covariate than without that covariate) is off. There is simply no reason for this to be true.

You can simulate simple cases like this multiple times, playing around with your simulation parameters, and check whether the property your intuition calls for holds. Here is a simple case with $n=20$ data points, a numerical IV and a two-level factor CV where it doesn't:

Regressing DV on IV alone gives $R^2=0.01628121$. A model with DV and CV gives $R^2=0.4843284$, and one with only CV gives $R^2=0.4656828$, so

$$\Delta R^2 = 0.4656828-0.4843284 = 0.01864566, $$

larger than the single-predictor $R^2$.

Note that we are analyzing a correctly specified model with the two main effects. You could probably create more counterexamples with misspecification. In this case, the issue is that CV already explains a lot of variance, so when we regress only on the IV, we incorrectly believe we can explain some variance that is actually driven by CV - and in comparing the two competing models, this variance is correctly already explained by CV.

R code:

nn <- 20
set.seed(1)
IV <- runif(nn)
CV <- as.factor(rep(c("A","B"),each=nn/2))
DV <- 1*IV+2*as.numeric(CV)-0.0*IV*as.numeric(CV)+rnorm(nn,0,1)
plot(IV,DV,col=CV,pch=19+as.numeric(CV),las=1)
summary(lm(DV~IV))$r.squared
summary(lm(DV~IV+CV))$r.squared
summary(lm(DV~CV))$r.squared
summary(lm(DV~IV+CV))$r.squared-summary(lm(DV~CV))$r.squared

Thank you - I am indeed a social psychologist and trying to get over the 'prediction' language, so that is a helpful reminder. However, I still don't get why this numerical result can be obtained - I don't doubt that it can be, since it arises from my data. Your description "when we regress on the IV, we incorrectly believe that we can explain some variance that is actually driven by the CV" would seem to imply that R2 is too large and thus larger than delta R2? — Lukas Wallrich, Nov 29 '22 at 17:25
I would not call it "too large". It's simply that there is no reason for $R^2$ to behave as your intuition would like it to do. Also, I don't see a quick-and-easy way of classifying situations where intuition misleads us here (although there may well be such a classification). whuber's comment is also very nice. — Stephan Kolassa, Nov 29 '22 at 23:14

Lukas Lohse · Accepted Answer · 2022-12-01T16:35:42.940

Stephan Kolassa has an important point and provided a good example for this being a common phenomenon but i think, that i can actually explain it. Namely this is a kind of weak Simpsons Paradox(https://en.wikipedia.org/wiki/Simpson's_paradox), which is a) not limited to subgroups and b) only suppresses the "effect" not wholly inverts it.

Consider the classic example of grad school admission(DV) at Berkeley in the 70s, where being a women(IV) was correlated with being in a more competitive Department(CV). Now the correlation between DV ad IV was negative , but once adjusted for the departments it actually changes into being positive. However if the correlation between IV and CV or the adjusted Correlation CV and DV had been smaller we might have seen no unadjusted(marginal) Correlation between IV and DV. This would translate to $R^2 = 0$ and then obviously $\Delta R^2 > R^2$.

In the example from Stephan Kolassa the adjusted effects are both positive and the correlation between IV, CV is (purely by chance) negative, which means that the marginal effect of IV on DV is suppressed by there being fewer CV = B for high IV values. If we simulate the setup 1000 times this becomes apparent:

nn <- 20
res <- replicate(1000, {
  IV <- runif(nn)
  CV <- as.factor(rep(c("A","B"),each=nn/2))
  DV <- 1*IV+2*as.numeric(CV)-0.0*IV*as.numeric(CV)+rnorm(nn,0,1)
  c(cor(as.numeric(CV), IV), (summary(lm(DV~IV+CV))$r.squared-summary(lm(DV~CV))$r.squared) - summary(lm(DV~IV))$r.squared)
})
plot(res[1, ], res[2, ], xlab = "Cor(IV, CV)", ylab = "deltaR2 - R2")
abline(h = 0, v = 0)

scatter plot showing, that <span class= $\Delta R^2 - R^2$ is negatively correlated with Cor(IV, CV)" />

Now this is messy, because the $\hat{\beta}$s have a high variance and if the sign of one of the betas changes then so does the effect. We can however see, that if IV and CV are uncorrelated, the $\Delta R^2 - R^2 = 0$ which, if you have a strong intuition about the geometry of a linear model, follows from Pythagorean Theorem.

If we increase the signal to noise ratio by using a smaller $\sigma$ the picture is much clearer

nn <- 20
res <- replicate(1000, {
  IV <- runif(nn)
  CV <- as.factor(rep(c("A","B"),each=nn/2))
  DV <- 4*IV+2*as.numeric(CV)-0.0*IV*as.numeric(CV)+rnorm(nn,0, 0.2)
  c(cor(as.numeric(CV), IV), (summary(lm(DV~IV+CV))$r.squared-summary(lm(DV~CV))$r.squared) - summary(lm(DV~IV))$r.squared)
})
plot(res[1, ], res[2, ], xlab = "Cor(IV, CV)", ylab = "deltaR2 - R2")
abline(h = 0, v = 0)

scatter plot showing, that <span class= $\Delta R^2 - R^2$ is strongly negatively correlated with Cor(IV, CV). Its almost a straight line." />

In conclusion: Looking at DV ~ IV only tells you about the sum of the direct effect from IV to DV and the indirect effect from IV to CV and then to DV. This indirect effect can be in the same direction ($\Delta R^2 < R^2$), opposite direction but of moderate size ($\Delta R^2 > R^2$) or outright overpower the direct effect like with Berkley in which case $R^2$ is not a very sensible metric. The indirect effect is basically the product of IV -> CV and the direct effect from CV to DV, so it's positive if both of those have the same sign and negative if they don't. Please consult causal theory to figure out, if marginal effects or adjusted effects are right for you. After all cigarettes are perfectly healthy if you adjust on the state of peoples lunges.

P.S. I've used the term direct effect which should be read as adjusted and effects are of course just correlations not causations.

Thank you for such a clear and comprehensive explanation - very helpful! — Lukas Wallrich, Dec 02 '22 at 15:14

How can delta (Δ) R² for a term be larger than R² for that term?

2 Answers2

Linked