Why do I have a negative R squared if my model has an intercept?

Question

Don't know what else to say, I am running a first difference panel IV model and getting a negative R squared. I imagine it has something to do with instrumental variables but can't figure out what. I am usig the ivreg function in R and have 700+ observations. Any ideas?

Having a intercept is relevant because if my model has an intercept and a negative R squared it would have a better fit (or a lower sum of squared errors), simply by setting all coefficients to 0 and setting the intercept equal to the dependent variable mean.

Here are the last few lines of summary(model):

Residual standard error: 5.281 on 646 degrees of freedom
Multiple R-Squared: -11.07, Adjusted R-squared: -12.28 
Wald test: 0.6517 on 65 and 646 DF,  p-value: 0.984

There's something basically wrong with your setup, because the correlation coefficient is a real number (assuming you're dealing only with real data), and hence $r^2\ge 0.$ Are you calling the function correctly? Are you calling it with the right data? — Adrian Keister, Oct 23 '20 at 20:16
@AdrianKeister R Squared is not the squared of R, hence R squared can be negative. It only has the upper bound of 1 see here — cach dies, Oct 23 '20 at 21:17

score 2 · Answer 1 · answered Mar 13 '23 at 18:04

By using an instrumental variable, you are not estimating the coefficients using vanilla ordinary least squares linear regression. Consequently, you can achieve a lower sum of squared residuals than you would if you predicted the mean value of $y$ every time. This results in the numerator of the fraction below exceeding the denominator, so the entire formula is less than zero.

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

This is a standard way to calculate $R^2$, equal to the squared correlation between predicted and true values in the OLS linear regression case, and equal to the squared correlation between $x$ and $y$ in the simple linear regression case (again, assuming OLS estimation). This is the equation that allows for the "proportion of variance explained" interpretation of $R^2$, too. All of this is to say that such a formula for $R^2$ is totally reasonable and sure seems to be how your software is doing the calculation. (After all, squaring a real correlation between the predictions and true values will not result in a number less than zero, so your software is not squaring a Pearson correlation and must be doing some other calculation.)

When you do estimation using a method other than OLS, it can happen that the numerator exceeds the denominator. I will demonstrate below using an estimation based on minimizing the sum of absolute residuals, but the idea is the same if you use any other non-OLS estimation technique (such as instrumental variables).

library(quantreg)
set.seed(2023)
N <- 10
x <- runif(N)
y <- rnorm(N)
L <- quantreg::rq(y ~ x, tau = 0.5)
preds <- predict(L)
sum((y - preds)^2)   # I get 7.260747
sum((y - mean(y))^2) # I get 4.731334
1 - sum((y - preds)^2)/sum((y - mean(y))^2) # Consequently, the R^2 is subzero at -0.5346087

Why do I have a negative R squared if my model has an intercept?

1 Answers1