7

If we have X and Y that are mathematically dependent: X = Y + Z, is it 'forbidden' to use Y as a predictor to X in linear regression? I'm trying to find a concise explanation for why it is, or isn't.

One explanation I've found is that you shouldn't use simple linear regression in this case, but multiple linear regression, with Y and Z as predictors, but this doesn't seem right. The mathematical relationship is exact, and the multiple regression would return coefficients 0, 1, and 1 for X = 0 + 1Y + 1Z. So, there is something more fundamental I'm missing.

The purpose for linear regression is predicting the value of one variable from the other. But we already know that X can be calculated as the sum of Y and Z, so effectively, we are calculating the regression between Y and Y + Z. Is this 'forbidden'?

But, if we can only measure Y, does the regression give us a tool to predict X? Isn't it better to find the regression between Y and Z? I'm confused.

gunes
  • 57,205
amc____
  • 85
  • I'm not sure I understand what you mean by "forbidden". Why would linear regression forbid the usage of some variables? – gunes Jun 09 '22 at 14:21
  • Of course, you can do anaything in mathematics, but maybe the interpretation is problematic, and that is why people don't do it - it is "fobridden". Like using linear regression when the relationship is very nonlinear; or when the variables are not homoscedastic. – amc____ Jun 09 '22 at 14:30
  • 1
    Regression models do their best to view the response variable (here, $X$) as a linear combination of other variables (such as $Y+Z$). You are, in effect, asking if it's forbidden to use linear regression in precisely the situation it's designed for! – whuber Jun 09 '22 at 14:38
  • In this situation it is given that X=0 + 1Y+1Z. For example, gross pay X is the sum of net pay Y and all deductions Z. The question is, Is it ever useful to find a relationship, using linear regression, between gross pay and net pay? Or is this a conceptual error? – amc____ Jun 09 '22 at 14:50
  • Are Y and Z independent? – Firebug Jun 09 '22 at 15:37
  • Y and Z are mathematically independent, but there could be a relationship in the data. – amc____ Jun 09 '22 at 15:49
  • 2
    If Z is zero-centered noise with an unknown value, it certainly makes sense to use Y as a predictor for X, in my opinion. – HelloGoodbye Jun 10 '22 at 13:24
  • @HelloGoodbye What if Z is not zero centered noise, but expected to be related to Y? – amc____ Jun 10 '22 at 15:38
  • @HelloGoodbye then ordinary linear regression will not give unbiased estimates. A very common example for this is running regressions to estimate AR(1) coefficients for time series: $X_t = \beta X_{t-1} + \epsilon_t$. – rubikscube09 Jun 10 '22 at 17:45
  • @amc____ If there is a relationship between Y and Z in the data, then what do you mean when you say that Y and Z are mathematically independent? That seems contradictory to me. – HelloGoodbye Jun 11 '22 at 16:15
  • @HelloGoodbye I mean that there isn't a formula relating Y and Z known before examining the data, like there is one relating X, Y and Z. – amc____ Jun 12 '22 at 06:09
  • @amc____ Aha, since statistics is part of mathematics, I though you meant statistical dependence when you said mathematical dependence. But I understand what yo mean. – HelloGoodbye Jun 14 '22 at 11:11
  • I would still argue that using Y as a predictor for X makes sense. If you know the joint probability distribution for X, Y and Z, i.e. $P(X, Y, Z)$, you can calculate the distribution of X conditioned on Y, i.e. $P(X|Y)$, and from there you can calculate all the statistics you need for X, like expected value, median, etc. And since $P(X|Y)$ depends on $Y$, it makes sense to use $Y$ as a predictor for $X$ (i.e., the expected value for X and other statistical measurements will depend on Y). – HelloGoodbye Jun 14 '22 at 11:16

3 Answers3

20

If you know $X = Y + Z$ and you have $Y$ and $Z$ measured, why would you need to run a regression of $X$ on $Y$ and $Z$? It provides no additional information and does not allow you to make "better" predictions about $X$ (since you know $X$ exactly from $Y$ and $Z$). But if you don't have $Z$, don't know the exact relationship between $Y$ and $X$, and want to predict $X$ from $Y$, then you absolutely can (and should!) regress $X$ on $Y$. The true marginal relationship between $X$ and $Y$ depends on the correlation between $Y$ and $Z$, which is not stated in the problem, so you won't automatically know what the marginal relationship between $X$ and $Y$ is from the formula for $X$ alone. Indeed, this is the usual case that we do regression: we assume a (possibly) deterministic relationship between the outcome and some other variables, but many of those variables are unmeasured, so their influence is captured in the error term. In the absence of measured $Z$, the effect of $Z$ will be captured in the error term, and the marginal effect of $Y$ will be estimated by the model.

Noah
  • 33,180
  • 3
  • 47
  • 105
  • But if you don't have Z and want to predict X from Y, then you absolutely can (and should!) regress X on Y.-> But couldn't you just calculate Z as X-Y, instead of correlating X and Y? – amc____ Jun 09 '22 at 15:01
  • To clarify - the important information here is the correlation between Z and Y. If you have it, and you measure Y, you can find Z from the statistical relationship, and X as the sum of Y and Z. Alternatively, if you try first regressing Y on X, you might get a very high correlation, because you are effectively doing regression of Y on Y + Z - this is usually going to be very high - for random variables Y and Z the r will tend to 0.7. – amc____ Jun 09 '22 at 15:11
  • 4
    What you're saying makes sense. I just mean that even if there is a deterministic relationship between {Y and Z} and X, if you don't know what that relationship is and you don't have Z, then a regression of X on Y is informative. If you know everything, you don't need to fit statistical models. That's the most confusing part about your question; the whole point of doing statistics is to estimate an unknown truth from sample data; in your case you have the data-generating parameters, so there is no reason to do statistics. It's not "forbidden", just useless. – Noah Jun 09 '22 at 16:17
  • Thanks, I apprechiate the answers, it does seem like it would be useless if we know all the data. I'm having trouble with the subtle issue of correlating the Y with the sum Y+Z. Would you say that correlating Y and Y+Z is a statistical or conceptual error, (both)? Would it not give spurius correlations, unless for special values of Z? – amc____ Jun 09 '22 at 17:06
  • $\text{Cov(Y, Y + Z)} = \text{Var}(Y) + \text{Cov}(Y, Z)$. The latter can take almost any value so there isn't much you can say about the correlation between Y and Y + Z except that it will be positive when Y and Z are positively correlated. – Noah Jun 09 '22 at 17:27
  • True, it can take any value. It seems to be dominated by the variance of Y - the larger the Var(Y), the larger the result. Are there some examples where this is practically useful? – amc____ Jun 09 '22 at 17:41
  • 1
    Where what is practically useful? It's not useful to think about purely deterministic relationships in statistics. Understanding the size of bias in the case of omitted variables is a major application and essentially the focus of the entire field of causal inference and a major part of econometrics. Look into omitted variable bias and unmeasured confounding to see this in action and the formulas used to quantify the bias. The covariance between the observed variable and omitted variable is a major component. – Noah Jun 09 '22 at 19:44
  • "you absolutely can (and should) regress X on Y" - they can, yes, but they should not necessarily, as this depends on what the aim of the practical analysis is and how results will be interpreted. Also Z and its relation to X and Y may be so that model assumptions of the linear regression are critically violated. So proceed with caution! @amc____ – Christian Hennig Jun 10 '22 at 09:10
6

Linear regression is a tool that is used to achieve a goal. So any answer will depend on the goal to be achieved. As said already in the answer of @Noah, if you already know $X$, $Y$, and $Z$, I can't see any goal worth achieving that is achieved by this regression. Why would you want imprecise predictions of $X$ if you can have precise ones?

If however you don't know $Z$, linear regression may work well for predicting $X$ from $Y$. There is nothing that formally forbids you from trying that out (in fact if $Z$ is independent of $Y$ and distributed according to standard regression model assumptions, the standard regression model is just fulfilled). But then, depending on the exact nature of the data (particularly the distribution of $Z$ and how it is related to $Y$), it may not work that well, and/or other techniques may work better. This can be explored for example using bootstrap or cross-validation.

5

Two people step on a scale. The scale outputs only the total weight of the couple.
If you know the weight of the first person ($Y$) can you guess what the scale will output ($X$) without knowing anything about the weight of the second person ($Z$)?

Intuitively, there has to be some connection. If $Y$ is very heavy, it is likely that the total weight $X$ will also be high.
Regressing X on Y here is just trying to put a number to that connection by looking at a bunch of real-life examples of total weight $X$ given first person weight $Y$.
It is not forbidden, as you say. It is simply trying to make the best guess given the information you have.

Of course if you know the weight of both people $Y$ and $Z$ and you know how scales work, you don't need to make observations and regression.

CarrKnight
  • 1,098
  • 9
  • 19