In linear regression, does $R^2$ really measure the fraction of explained variation?

Question

I created an array of test data as a linear combination of a known random value and an unknown random value.

Y  <- (Known*P) + (Unknown*(1-P))

I then made a linear regression model of Y against the Known values and extracted the $R^2$ value.

model <- lm (Y ~ Known)
measured.Rsquared <- summary(model)$r.squared

A plot of $R^2$ against P is s-shaped; it is not a straight line. I have read many times that $R^2$ is the fraction of explained variation, but the s-shaped curve suggests that when P is large (and so the proportion of Y explained by the Known values is high), then $R^2$ is larger than it should be. For example, if P is 0.8, $R^2$ is about 0.9. The graph looks like this:

I do not understand. Maybe I just need to read the right textbook? Any help would be welcomed!

The code is as follows; apologies for my poor R coding skills.

# Let Y be a linear combination of a known variable and an unknown
The relationship between the fraction of Y determined by Known
and R squared is not a straight line, the graph is S-shaped
Repeats = 10  # number of repeats for each value of fraction
              # not just one as there is some scatter
number.of.fraction.values = 99
measured.Rsquared = rep(NA, Repeatsnumber.of.fraction.values)
proportion.Known = rep(NA, Repeatsnumber.of.fraction.values)
pos=1
for (i in 1:number.of.fraction.values) {
    P = i/100
    for (j in 1:Repeats) {
        # We generate 1000 values in the range 1-10000 for Known cause:
    Known  = sample.int(10000, 1000)
        # Similarly generate random values for Unknown cause:
    Unknown  = sample.int(10000, 1000)
        # or
    #Unknown  <- rnorm(1000, mean=2000, sd=2500)
    #We can now generate the 1000 Y values for this value of the fraction P
Y  &lt;- (Known*P) + (Unknown*(1-P))

model = lm (Y ~ Known)

measured.Rsquared[pos] = summary(model)$r.squared
    proportion.Known[pos] = P
pos = pos+1
}

}
plot(proportion.Known, measured.Rsquared,
     main="R squared as Known proportion increases")

It doesn't matter whether the values for Known and Unknown are generated by sample.int() or rnorm()

Going to the mathematical definition(s) of $R^2$ might be clarifying. — Galen, Oct 26 '22 at 22:25
According to the assumptions of a linear model, $Y = a + bX + \epsilon$ where $\epsilon$ is a random error. That is, if the assumptions of independence and model specification are correct, then yes. However, that's not what you simulated, so the S curve doesn't relate to this in any way. — AdamO, Oct 26 '22 at 22:42
My derivation here might prove useful for seeing how the math works out. — Dave, Oct 26 '22 at 22:42

Gordon Smyth · Accepted Answer · 2022-10-27T07:21:41.957

$R^2$ measures "variation" in a specific way, similar to what is done by analysis of variance. $R^2$ measures the proportion of the sum of squares that is explained by linear regression, and the proportion of sum of squares in turn estimates the proportion of the variance explained by linear regression. Sums of squares and variances relate to squared residuals rather than to linear quantities, in other words "variation" is additive on the squared scale rather than on the linear scale. For your simulation model

Y  <- Known*P + Unknown*(1-P)

the proportion of variance explained by linear regression on Known is not $P$, as you seem to assume, but rather $$\frac{P^2}{P^2 + (1-P)^2},$$ assuming that Known and Unknown are independent and have the same variance.

If $P=0.8$ then the proportion of variance explained is $0.941$. $R^2$ estimates the correct proportion, as can be seen from a simulation:

> Known <- rnorm(1e6)
> Unknown <- rnorm(1e6)
> P <- 0.8
> Y <- Known*P + Unknown*(1-P)
> model <- lm(Y~Known)
> measured.Rsquared <- summary(model)$r.squared
> measured.Rsquared
[1] 0.941245

If you did want $R^2$ to estimate $P$, then you would need to simulate the model

Y  <- Known*sqrt(P) + Unknown*sqrt(1-P)

In that case, the proportion of variance explained by linear regression on Known really would be equal to $P$:

> Y <- Known*sqrt(P) + Unknown*sqrt(1-P)
> model <- lm(Y~Known)
> measured.Rsquared <- summary(model)$r.squared
> measured.Rsquared
[1] 0.800806

Thank you, that is helpful. I think it answers my question, but let me sleep on it first before I accept the answer. Unfortunately a lot of things I have read say "variation" rather than "variance". I need to study more! — HugheyDrambuie, Oct 26 '22 at 23:26
Exactly. What I think OP was assuming was that variance is a linear operator. It isn't. Instead: $\operatorname{Var}(\alpha X) = \alpha^2 \operatorname{Var}(X)$ — Matthew Gunn, Oct 27 '22 at 00:03
@HugheyDrambuie I stand by what I wrote here about variability, variation, and variance. — Dave, Feb 10 '23 at 18:36

In linear regression, does $R^2$ really measure the fraction of explained variation?

The relationship between the fraction of Y determined by Known

and R squared is not a straight line, the graph is S-shaped

1 Answers1

Linked