Removing variance in $y$ explained by $x$

Question

If I have three variables:

$x$
$y$
$z$

...how would I go about calculating an "adjusted $z$" measure that has the variation in $z$ that is explained by $x$ and $y$ removed?

Welcome to CV! This is called the "residual." See https://stats.stackexchange.com/questions/17336 or https://stats.stackexchange.com/questions/46151 for instance. — whuber, Mar 02 '23 at 23:52
@whuber Right, but how would I use the residuals to get a measure on the same scale as y? Subtract it from the average or real values for example? — user382266, Mar 02 '23 at 23:57
That wouldn't make any sense, because there is no reason why z should even be in commensurate units with y. Consider, for instance, a regression of income (z) against age (x) and educational attainment (y): you can't put educational attainment on the same scale as income! The residuals are income differences: namely, the differences between the observed and fitted values. — whuber, Mar 02 '23 at 23:59
so sorry man, I meant to say on the same scale as z! Thanks for bearing with me — user382266, Mar 03 '23 at 00:02

Dave · Answer 1 · 2023-03-03T00:52:11.817

It depends on what exactly you mean by the variance of $Z$ after you account for the variance explained by $X$ and $Y$. However, a pretty reasonable interpretation of this relates to the $R^2$ of a linear regression. Use your $X$ and $Y$ to predict $Z$, as in the following model.

$$ \mathbb E[Z_i] = \beta_0 + \beta_1X_i + \beta_2 Y_i $$

Then use ordinary least squares to estimate the $\beta$ coefficients.

Next, calculate the $R^2$ of this model. Under these conditions with a linear model estimated with ordinary least squares, $R^2$ is interpreted as the porportion of variance of $Z$ that is explained by the regression, so by your predictor variables $X$ and $Y$. This is explained in a standard regression book like Agresti (2015). Consequently, $\left(1-R^2\right)$ is the proportion of variance in $Z$ that is not explained by the regression. You have $Z$, so you can calculare $\text{var}(Z)$. Now multiply $\left(1-R^2\right)\text{var}(Z)$ to get the variance of $Z$ that remains unexplained.

This can be performed in software. I get that the unexplained variance of $Z$ is $0.9792228$.

set.seed(2023)
N <- 100                      # Sample size
x <- rnorm(N)                 # Varible X
y <- rnorm(N)                 # Varible Y
z <- x - y + rnorm(N)         # Variable Z
L <- lm(z ~ x + y)            # Fit the linear model, estimated via 
                              # ordinary least squares
r2 <- summary(L)$r.squared    # Calculate R-squared of the regression
var(z) * (1 - r2)             # Variance of Z that is not explained by X and Y

However, this is equal to the variance of the residuals, calculated via var(resid(L)). This is because $R^2$ involves the variance of the the residuals term (estimated variance of $Z$, conditional on $X$ and $Y$) divided by the unconditional (marginal or pooled) variance of $Z$.

$$ R^2 = 1-\dfrac{ \text{var}(Z\vert X, Y) }{ \text{var}(Z) } $$

Since the fraction has the same units in the numerator and denominator, the units cancel and give a unitless $R^2$. Then then math of what I described and simulated in my code is:

$$ (1-R^2)\times(\text{var}(Z))=\\ \left[ 1 -\left( 1-\dfrac{ \text{var}(Z\vert X, Y) }{ \text{var}(Z) } \right) \right]\times(\text{var}(Z))=\\ \dfrac{ \text{var}(Z\vert X, Y) }{ \text{var}(Z) }\times(\text{var}(Z))=\\ \text{var}(Z\vert X, Y) $$

REFERENCE

Agresti, Alan. Foundations of linear and generalized linear models. John Wiley & Sons, 2015.

EDIT

Some issues to consider:

What if the true relationship between $Z$ and $X$ or $Y$ (or both) is not linear, say $Z = X^2 + Y$, but you only fit the regression to $X$ and $Y?$ You are getting at the variance of $Z$ that is or is not explained by $X$ and $Y$ as they enter the regression, yes, but it could be argued (I think strongly) that you aren't really accounting for the entire way in which $X$ explained the variance of $Z$. Dealing with this sort of situation is what leads to regression-strategies like spline basis functions.
If you want to draw inferences to something greater than your observed $Z$, such as a population, there's more to the story, even if fiddling with $R^2$ and the residual variance is the beginning of that story.

Removing variance in $y$ explained by $x$

1 Answers1