I created an array of test data as a linear combination of a known random value and an unknown random value.
Y <- (Known*P) + (Unknown*(1-P))
I then made a linear regression model of Y against the Known values and extracted the $R^2$ value.
model <- lm (Y ~ Known)
measured.Rsquared <- summary(model)$r.squared
A plot of $R^2$ against P is
s-shaped; it is not a straight line. I have
read many times that $R^2$ is the fraction of explained variation, but
the s-shaped curve suggests that when P is large (and so the
proportion of Y explained by the Known values is high), then $R^2$ is
larger than it should be. For example, if P is 0.8, $R^2$ is about 0.9.
The graph looks like this:
I do not understand. Maybe I just need to read the right textbook? Any help would be welcomed!
The code is as follows; apologies for my poor R coding skills.
# Let Y be a linear combination of a known variable and an unknown
The relationship between the fraction of Y determined by Known
and R squared is not a straight line, the graph is S-shaped
Repeats = 10 # number of repeats for each value of fraction
# not just one as there is some scatter
number.of.fraction.values = 99
measured.Rsquared = rep(NA, Repeatsnumber.of.fraction.values)
proportion.Known = rep(NA, Repeatsnumber.of.fraction.values)
pos=1
for (i in 1:number.of.fraction.values) {
P = i/100
for (j in 1:Repeats) {
# We generate 1000 values in the range 1-10000 for Known cause:
Known = sample.int(10000, 1000)
# Similarly generate random values for Unknown cause:
Unknown = sample.int(10000, 1000)
# or
#Unknown <- rnorm(1000, mean=2000, sd=2500)
#We can now generate the 1000 Y values for this value of the fraction P
Y <- (Known*P) + (Unknown*(1-P))
model = lm (Y ~ Known)
measured.Rsquared[pos] = summary(model)$r.squared
proportion.Known[pos] = P
pos = pos+1
}
}
plot(proportion.Known, measured.Rsquared,
main="R squared as Known proportion increases")
It doesn't matter whether the values for Known and Unknown
are generated by sample.int() or rnorm()
