2

I'm confused by something I've found by adding a variable with no relationship with a DV, using multiple regression with four predictors and one DV (Y). If I regress $Y$ onto $X_1,~ X_2,$ and $X_3$, the multiple $R$ is less than if I add a 4th predictor with no relationship with $Y$. I didn't think this was possible. I've done this via a simulation and also more manually, with each shown below. What's even more confusing is that if I specify the 4th variable to have a correlation of $.2$ with the DV, the $R^2$ is less than if the 4th variable has a correlation of $0$ with the DV. How is this possible?

### via simulation ###

library(MASS) library(psych)

rx12 = .2 rx13 = .25 rx14 = .3 rx23 = .35 rx24 = .3 rx34 = .4

rx1y = .15 rx2y = .25 rx3y = .2 rx4y = 0

corr_matrix <- matrix(c(1, rx12, rx13, rx14, rx1y, rx12, 1, rx23, rx24, rx2y, rx13, rx23, 1, rx34, rx3y, rx14, rx24, rx34, 1, rx4y, rx1y, rx2y, rx3y, rx4y, 1), nrow=5) corr_matrix #this shows the correlation is zero#

set.seed(33) data = as.data.frame (mvrnorm(n=1000, mu=c(.0, .0, .0, .0, 0), Sigma=corr_matrix, empirical=TRUE)) psych::corr.test(data)$r #this shows the correlation is zero#

summary(lm(V5 ~ V1 + V2 + V3, data=data)) #R^2 = .0833 summary(lm(V5 ~ V1 + V2 + V3 + V4, data=data)) #R^2 = .1044

matrix multiplication with all 4 variables

corr_matrix_x <- matrix(c(1, rx12, rx13, rx14, rx12, 1, rx23, rx24, rx13, rx23, 1, rx34, rx14, rx24, rx34, 1), nrow=4) corr_matrix_y <- matrix(c(rx1y, rx2y, rx3y, rx4y), nrow=4) corr_matrix_y #this shows the correlation is zero#

x_inverse <- solve(corr_matrix_x) betas <- as.matrix(x_inverse %% corr_matrix_y) t(betas) %% corr_matrix_y #R^2 = .1044

3 variables

corr_matrix_x <- matrix(c(1, rx12, rx13, rx12, 1, rx23, rx13, rx23, 1), nrow=3) corr_matrix_y <- matrix(c(rx1y, rx2y, rx3y), nrow=3)

x_inverse <- solve(corr_matrix_x) betas <- as.matrix(x_inverse %% corr_matrix_y) t(betas) %% corr_matrix_y #R^2 = .0833

Follow-up: My friend visually examined the issue of R^2 being larger when the X4 correlation with Y is zero or small (~ <.2), and then only increasing at correlations ~>.2. Image below:

enter image description here

Drew
  • 39
  • 1
    Could you explain what you mean by "having no relationship"? I cannot see anything in the code that corresponds to guaranteeing any of your variables will have a zero regression coefficient. Please note that neither the covariance matrix nor the correlation matrix ordinarily provide any of the regression coefficients. https://stats.stackexchange.com/a/108862/919 shows how to find the coefficients from the covariance matrix. – whuber Dec 28 '22 at 16:27
  • call "corr_matrix" and then run the matrix algebra to derive beta weights and multiple R. Or, call psych::corr.test(data)$r on the generated data, where it is essentially zero (-3.111379e-16) – Drew Dec 28 '22 at 16:36
  • 1
    You are examining only the univariate correlation, not the multiple regression. Look at with(as.data.frame(residuals(lm(cbind(V4, V5) ~ ., data))), {plot(V4, V5); abline(lm(V5 ~ V4))}). This exhibits a clear negative linear relationship between V4 and V5 after controlling for the effects of the other three variables. – whuber Dec 28 '22 at 16:38
  • Still having a hard time wrapping my head around it.... controlling for other variables "unleashes" an unknown relationship between X4 and Y? I still am having a hard time comprehending this - in the syntax - instead of "rx4y = 0", if you place "rx4y = .20, R^2 is lower. – Drew Dec 28 '22 at 16:47
  • 1
    https://stats.stackexchange.com/a/46508/919 is an extended explanation of this. But you have succinctly described the key idea of multiple regression: the other variables have a profound effect on what the model means and on the apparent relationships between a given explanatory variable and the response. That's why multiple regression isn't just a bunch of univariate regressions. – whuber Dec 28 '22 at 18:21

4 Answers4

3

This will always be the case until you add enough variables so that the number of independent variables equals the number of observations and the $\mathrm{R}^2 = 1$. It happens because there's a difference between the theoretical correlation between two random variables and the correlation between samples drawn from their distributions; the samples will, by the nature of randomness, (almost) certainly have nonzero correlations, and will therefore reduce the residual variance slightly, improving $\mathrm{R}^2$.

An example:

y <- rnorm(20)
x <- matrix(rnorm(400),20,20)

summary(lm(y~x-1)) Call: lm(formula = y ~ x - 1)

Residuals: ALL 20 residuals are 0: no residual degrees of freedom!

Coefficients: Estimate Std. Error t value Pr(>|t|) x1 -10.7069 NaN NaN NaN x2 11.3354 NaN NaN NaN ... and so on ... x20 -11.5178 NaN NaN NaN

Residual standard error: NaN on 0 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: NaN F-statistic: NaN on 20 and 0 DF, p-value: NA

Note the Multiple R-squared at the bottom equals 1!.

A bivariate example may help too:

x <- rnorm(100)
y <- rnorm(100)
cor(x,y)
[1] -0.09046082
jbowman
  • 38,614
  • But in the example I shared, the rxy for variable 4 is specified as zero and the data reflects that, and there are 1000 cases. If that variable has no relationship with the DV in the observed data, how does it improve R^2? More importantly perhaps, how is the R^2 higher when variable 4 rxy=0 as opposed to when variable 4 rxy=.20. If you switch the syntax I shared, you'll find the former has R2 of .10 (when rxy=0) and .09 (when rxy=.2) – Drew Dec 28 '22 at 16:10
  • 3
    The key is that it will have a correlation with the residual of the regression of all the other variables on the dependent variable, and will therefore reduce the residual variance if added to the regression. – jbowman Dec 28 '22 at 16:19
  • 2
    To see this, try adding these two lines to your code: y <- residuals(lm(V5~V1+V2+V3, data=data)) cor(y, data$V4) .. and you will see there is indeed a nonzero correlation. – jbowman Dec 28 '22 at 16:26
  • Thanks, whuber also mentioned this. That makes some sense, though I just assumed that relationship was spurious after controlling for the other Xs. Because in real life no one finds a variable correlated exactly zero with a DV, I had never thought of or seen the scenario I describe, and I assumed that a variable with a correlation of zero could add nothing new to a regression equation. Once again, I'm still very confused how if that same variable was rxy=.2 instead of .0, the R^2 is lowered – Drew Dec 28 '22 at 16:52
  • 1
    a) Random chance, and b) the important thing is the correlation with the residuals from a regression with all the other variables, not the correlation with the dependent variable. Math and randomness together do generate non-intuitive results sometimes! – jbowman Dec 28 '22 at 17:58
  • In your first sentence I think you have dependent and independent the wrong way round. (This adds fuel to my campaign to remove both terms from our discussions.) – Nick Cox Dec 28 '22 at 18:05
  • @NickCox - right, thanks for the catch. I prefer "target", but try to match my choices with those of the OP... maybe I should stop doing that. – jbowman Dec 28 '22 at 18:07
  • I will use just about any alternative to independent variable, sometimes predictor or covariate. (I accept that many want to use predictor for the entire beast $Xb$ or an equivalent.) – Nick Cox Dec 28 '22 at 18:24
  • Thanks again. Regarding random chance, I embedded all this in a simulation where the X intercorrelations and rxy correlations varied across 1000 runs, with consistently the same results. Learned something new today that a variable with no relation to Y can improve R2 when added to the model. Still can't wrap my head around how when that same variable is specified as rxy=.2 instead of 0, the results are worse. That is not intuitive. Somehow that rxy=0 accounts for more unique variance. – Drew Dec 28 '22 at 18:38
1
  • A variable that has zero correlation with $Y$ can improve the model.
  • The correlation with $Y$ does not indicate by how much the model will improve.
  • A less strong correlation can be better.

The following example illustrates what can be going on.

Let the true relationship be

$Y = a + b X_1 + \epsilon$

Let the variable that we use in the regression be instead of $X_1$ something slightly changed $Z_1 = X_1 + Z_2$ where $Z_2$ has no relationship with $Y$ but it correlates with $Z_1$.

Now a fit with $Z_1$ will result in a less good $R^2$ than a fit with $X_1$. Adding the variable $Z_2$ to the regression can correct for this by reducing some of the part of $Z_1$ that is not correctly modelling the $Y$ variable. The extra variable $Z_2$ does not need to have a direct relationship with $Y$, it can also work by having a relationship with $Z_1$ and $Y$ combined.

What's even more confusing is that if I specify the 4th variable to have a correlation of $.2$ with the DV, the $R^2$ is less than if the 4th variable has a correlation of $0$ with the DV.

We can visualize this when we consider two explanatory variables $x_1$ and $x_2$ such that we can imagine the vectors in 3D.

example

The fitted vector $\hat{y}$ will be a vector inside the plane spanned by $x_1$ and $x_2$. The square of the correlation between $\hat{y}$ and $y$ is the $R^2$ value. This value is highest ($R^2 = 1$) when $y$ is inside the plane.

Now consider a given correlation between $y$ and $x_1$ then the vector $y$ will lie on a circle around the vector $x_1$. For different points on that circle the vector $y$ will have different correlations with $x_2$. In the example the vector $y$ has zero correlation with $x_2$ when the vector is inside the plane. This is the worst case example where 0 correlation results in the highest $R^2$. Changing the correlation between $y$ and $x_2$ away from 0 (by choosing another position on the circle) will decrease the $R^2$ value.

  • OP sounds like more confused about why adding a predicator variable which is correlated with Y would get less overall R2 than your Z2 case which is uncorrelated with Y but accounts for more variance of Y overall. Adding this piece could improve theoretical intuition about such common phenomenon without any empirical simulation need. – cinch Dec 28 '22 at 21:33
  • @mohottnad maybe I am reading it completely wrong. You say that the OP wonders about "why adding a predicator variable which is correlated with Y would get less overall R2" but I read the question as "why adding a predicator variable which is not correlated with Y would give a higher R2" – Sextus Empiricus Dec 28 '22 at 21:37
  • See last comment/complaint from the accepted answer above. "Still can't wrap my head around how when that same variable is specified as rxy=.2 instead of 0, the results are worse. That is not intuitive. Somehow that rxy=0 accounts for more unique variance." – cinch Dec 28 '22 at 22:16
  • Yes, I am indeed wondering exactly what you describe here. None of the answers on the page really address this, and it would be really helpful if there was some answer to this conundrum - @mohottnad – Drew Dec 28 '22 at 23:21
  • @Drew could you change the title of your question in that case. – Sextus Empiricus Dec 29 '22 at 08:58
  • It's worth noting here for your cosine similarity explanation section all three random variables should be in their respective centered versions, if not their standardized versions. – cinch Dec 29 '22 at 23:49
  • @mohottnad you mean that $y$ does not just describe that circle, which relates to a fixed length of $y$, but also an entire cone? – Sextus Empiricus Dec 30 '22 at 07:20
  • I mean you seem to invoke cosine similarity formula to geometrically explain/represent correlation coefficient formula between 2 random variables which only makes sense when the means of r.v. X and Y are both 0. Btw it seems you have a bent to interpret r.v.'s and statistical sample data in terms of non-empirical geometric linear algebra top-down way which funnily sounds like contrary to your displayed name!? – cinch Dec 30 '22 at 22:37
  • @mohottnad I considered the actual realisations of the variables. When we fit a model to some vector $y$ with the least squares method, then this is equivalent to finding a projection of the vector $y$ onto the plane spanned by the variables $x_i$. That's what the image represents. And, if I am not mistaken, the correlations in the question seem to be the actual correlations between the sampled variables, and not the population correlation matrix. The sampling for the example in the question (not my idea) is done in a way that the correlations are guaranteed to be equal to corr_matrix. – Sextus Empiricus Dec 31 '22 at 11:16
0

You should consider the so-called Adjusted R-squared that imposes a penalty for considering additional independent variables to a model. The Un-adjusted R-squared can never fall when a new independent variable is added to a regression equation: this is because Sum of Squared Residuals (SSR) never goes up (and usually falls) as more independent variables are added (assuming we use the same set of observations). Adjusted R-squared can be calculated as follows: $$ \overline{R}^2=1-\frac{(1-R^2)(n-1)}{n-k-1} $$ where $R^2$ is the Un-adjusted R-squared, $k$ is the number of independent variables and $n$ the sample size.

Barbab
  • 333
  • Thanks, but that is irrelevant to the question. I'm trying to understand how the scenario above is even possible. – Drew Dec 28 '22 at 16:00
  • It is not irrelevant the more related are independent variables, the less explanatory power they have on the dependent variable and you should not consider un-adjusted r-squared since it increases as the number of independent variables increases. – Barbab Dec 28 '22 at 16:07
  • The answer doesn't address the actual question. Sure, adjusted R^2 can be useful, but it doesn't provide an answer in respect to the phenomena being described (R^2 in sample). I want to understand why the above is occurring, not options to use in practice. – Drew Dec 28 '22 at 16:13
0

It would be so nice if the regression machinery could just pick up the lack of a relationship. Unfortunately, all it does is see potential explainers of the variability, and the algorithm figures out the coefficients that result in the smallest sum of squared errors. That can result in fitting features to the noise (error term). This is related to, if not a light version of, the “overfitting” discussed in machine learning circles.

Dave
  • 62,186
  • Well, it is different than overfitting because I care just about the R^2 in the calibration sample in this case (not holdout) - I had no idea that a variable with rxy=0 could improve R^2 at all. Once again, how the heck does that same variable, when rxy=.2, get a worse R^2. Never seen anything written on this – Drew Dec 28 '22 at 16:39