If I have three variables:
- $x$
- $y$
- $z$
...how would I go about calculating an "adjusted $z$" measure that has the variation in $z$ that is explained by $x$ and $y$ removed?
If I have three variables:
...how would I go about calculating an "adjusted $z$" measure that has the variation in $z$ that is explained by $x$ and $y$ removed?
It depends on what exactly you mean by the variance of $Z$ after you account for the variance explained by $X$ and $Y$. However, a pretty reasonable interpretation of this relates to the $R^2$ of a linear regression. Use your $X$ and $Y$ to predict $Z$, as in the following model.
$$ \mathbb E[Z_i] = \beta_0 + \beta_1X_i + \beta_2 Y_i $$
Then use ordinary least squares to estimate the $\beta$ coefficients.
Next, calculate the $R^2$ of this model. Under these conditions with a linear model estimated with ordinary least squares, $R^2$ is interpreted as the porportion of variance of $Z$ that is explained by the regression, so by your predictor variables $X$ and $Y$. This is explained in a standard regression book like Agresti (2015). Consequently, $\left(1-R^2\right)$ is the proportion of variance in $Z$ that is not explained by the regression. You have $Z$, so you can calculare $\text{var}(Z)$. Now multiply $\left(1-R^2\right)\text{var}(Z)$ to get the variance of $Z$ that remains unexplained.
This can be performed in software. I get that the unexplained variance of $Z$ is $0.9792228$.
set.seed(2023)
N <- 100 # Sample size
x <- rnorm(N) # Varible X
y <- rnorm(N) # Varible Y
z <- x - y + rnorm(N) # Variable Z
L <- lm(z ~ x + y) # Fit the linear model, estimated via
# ordinary least squares
r2 <- summary(L)$r.squared # Calculate R-squared of the regression
var(z) * (1 - r2) # Variance of Z that is not explained by X and Y
However, this is equal to the variance of the residuals, calculated via var(resid(L)). This is because $R^2$ involves the variance of the the residuals term (estimated variance of $Z$, conditional on $X$ and $Y$) divided by the unconditional (marginal or pooled) variance of $Z$.
$$ R^2 = 1-\dfrac{ \text{var}(Z\vert X, Y) }{ \text{var}(Z) } $$
Since the fraction has the same units in the numerator and denominator, the units cancel and give a unitless $R^2$. Then then math of what I described and simulated in my code is:
$$ (1-R^2)\times(\text{var}(Z))=\\ \left[ 1 -\left( 1-\dfrac{ \text{var}(Z\vert X, Y) }{ \text{var}(Z) } \right) \right]\times(\text{var}(Z))=\\ \dfrac{ \text{var}(Z\vert X, Y) }{ \text{var}(Z) }\times(\text{var}(Z))=\\ \text{var}(Z\vert X, Y) $$
REFERENCE
Agresti, Alan. Foundations of linear and generalized linear models. John Wiley & Sons, 2015.
EDIT
Some issues to consider:
What if the true relationship between $Z$ and $X$ or $Y$ (or both) is not linear, say $Z = X^2 + Y$, but you only fit the regression to $X$ and $Y?$ You are getting at the variance of $Z$ that is or is not explained by $X$ and $Y$ as they enter the regression, yes, but it could be argued (I think strongly) that you aren't really accounting for the entire way in which $X$ explained the variance of $Z$. Dealing with this sort of situation is what leads to regression-strategies like spline basis functions.
If you want to draw inferences to something greater than your observed $Z$, such as a population, there's more to the story, even if fiddling with $R^2$ and the residual variance is the beginning of that story.