6

Why do a pair of variables with no significant correlation and no significant regression intercept and slope, have a highly significant regression with high adjusted $R^2$ when the regression is forced to pass through the origin?

pedram
  • 63

2 Answers2

15

Here is an illustration that simulates $y$ and $x$ independently of each other so that the true slope is zero. The mean of $y$ is nonzero, such that the true intercept is also nonzero.

The LS line without intercept must start at $(0,0)$ without intercept, and will try to "catch up" with the data points as quickly as possible if $y$ has nonzero mean, which induces a clear slope (purple line), while the blue line with intercept may start at the right level for $y$ right away, such that it "needs" no slope.

Note however that this example will typically exhibit a significant intercept in the model with intercept.

enter image description here

n  <- 100
mu <- 10
y  <- rnorm(n, mean=mu)
x  <- runif(n)

plot(x, y, ylim=c(0, mu+3))
abline(v=0, lty=2)
abline(h=0, lty=2)
abline(lm(y~x), col="lightblue", lwd=2)
abline(lm(y~x-1), col="purple", lwd=2)
abline(h=mu, lwd=2)
legend("bottom", legend=c("with intercept","without intercept","truth"), 
       col=c("lightblue","purple","black"), lty=1, lwd=2)

We can also analyze the issue theoretically. Suppose the true model is $$ y_i=\alpha+\epsilon_i, $$ i.e., $$ y_i=\alpha+\beta x_i+\epsilon_i\qquad\text{with}\qquad\beta=0 $$ or $E(y_i|x_i)=E(y_i)=\alpha$.

Under this model and assuming $E(x_i\epsilon_i)=0$ for simplicity (i.e. no further misspecification than a missing intercept), the plim for the OLS estimator $\hat\beta=\sum_ix_iy_i/\sum_ix_i^2$ of a regression of $y_i$ on $x_i$ without constant is given by \begin{align*} \text{plim}\frac{\sum_ix_iy_i}{\sum_ix_i^2}&=\text{plim}\frac{\sum_ix_i(\alpha+\epsilon_i)}{\sum_ix_i^2}\\ &=\text{plim}\frac{\frac{1}{n}\sum_ix_i(\alpha+\epsilon_i)}{\frac{1}{n}\sum_ix_i^2}\\ &=\text{plim}\frac{\alpha\frac{1}{n}\sum_ix_i+\frac{1}{n}\sum_ix_i\epsilon_i}{\frac{1}{n}\sum_ix_i^2}\\ &=\frac{\alpha E(x_i)}{E(x_i^2)} \end{align*} For example, in the numerical illustration, we have $\alpha=10$, $E(x_i)=1/2$ and $E(x_i^2)=1/3$.

Hence, unless we are in the special cases that $E(y_i)=0$ or $E(x_i)=0$, OLS is inconsistent for $\beta=0$, $\text{plim}\hat\beta\neq0$.

In the first case, we do not need a sloping $\hat\beta$ anyway, in the second, a flat line is "best" for OLS as smaller squared mistakes for positive fitted values for positive $x_i$ (in the case of a positive estimated slope) would be overcompensated by much larger squared mistakes for negative fitted values for negative $x_i$.

-1

Basically, to force a regression through zero the statistical software will enter in an infinite amount of data points at (0,0). This makes the normal R^2 formula useless, and a different R^2 formula is used. The result of this different R^2 formula is always very high. You can go to this link to get more specifics- https://www.riinu.me/2014/08/why-does-linear-model-without-an-intercept-forced-through-the-origin-have-a-higher-r-squared-value-calculated-by-r/