Why $R^2$ of combined data is low than the individual data set?

Question

I am running a Linear regression and had encountered a situation where if I divide the total timeseries of data into segments I see a $R^2$ of 60% to 80%. But when I stack this data segment into a single data set and perform the regression the $R^2$ drop to just 10%. In other words if we consider 10 years of data and I run regression for yearly data (that is 10 regressions) the $R^2$ for these regression ranges from 60% to 80%, but when I stack this segments the regression results into a model with only 10% $R^2$. I don't understand why this is happening, can anyone help me to point out what might be the reason?

This is like a moderate version of Simpson's paradox. The pattern of data presumably changes over time from segment to segment, so when combined the relationship of the variables you are looking at appears more diffuse. It is possible to get the opposite effect too. — Henry, Oct 13 '22 at 16:01
Consider data that follow a sine wave with little or no error: the overall regression over one period will have an $R^2$ of $0,$ even though regressions over most subintervals will have positive $R^2,$ approaching $1$ for the smallest subintervals. For the inverse problem--high overall $R^2$ despite low $R^2$ values on subintervals--see https://stats.stackexchange.com/a/13317/919. — whuber, Oct 13 '22 at 16:34

score 0 · Answer 1 · answered Mar 10 '23 at 05:22

I think an example can tell the story quite well.

set.seed(2023)
N <- 1000
x <- runif(N, -3, 3)
y <- x^2 + rnorm(N)
L0 <- lm (y[x < 0] ~ x[x < 0])
L1 <- lm (y[x > 0] ~ x[x > 0])
L <- lm(y ~ x)
summary(L0)$r.squared # I get 0.8176547
summary(L1)$r.squared # I get 0.7922851
summary(L)$r.squared  # I get 0.001809828

When you break the data into chunks (positive and negative values of x, in this case), each chunk has a pretty high $R^2$ score. However, combining the chunks results in a low $R^2$ score. A graph shows what's happening.


library(ggplot2)
d0 <- data.frame(
  x = x[x < 0],
  y = y[x < 0],
  Chunk = "x < 0"
)
d1 <- data.frame(
  x = x[x > 0],
  y = y[x > 0],
  Chunk = "x > 0"
)
d <- rbind(d0, d1)
ggplot(d, aes(x = x, y = y, col = Chunk)) +
  geom_point()

Of course a line is a terrible fit to a parabola, even if a line is a decent approximation to each side of the parabola.

Why $R^2$ of combined data is low than the individual data set?

1 Answers1