2

If we do linear regression we have the following decomposition of the sum of square. I call $y_i,x_i,\hat{y}_i$ the observed variables, the predictors and the linear predictions respectively. Then the residuals are $e_i=y_i-\hat{y}_i$. We also indicate with on overline the mean over all samples. Than:

$$\sum_i (y_i-\overline{y})^2=\sum_i \left(\hat{y}_i-\overline{\hat{y}}\right)^2+\sum_i (e_i-\overline{e})^2 \tag{1}$$

There are other few properties ($\overline{y}=\overline{\hat{y}}$ and $\overline{e}=0$).

A similar formula comes from the theory of conditional expectation. Given two r.v. $X,Y$:

$$Var(Y)=Var(E[Y|X])+E[Var(Y|X)] \tag{2}$$

I see some similarities in these two formulas, if we consider the $\hat{y}_i$ related to $E[Y|X]$.

My question is this. Can the first formula be derived from the second? How deep is their connection (if indeed there is one) ?

Tim
  • 138,066
Thomas
  • 880
  • 4
  • 16

1 Answers1

2

Regression models in general model the conditional expectations. It's not an expectation, but a specific form of expectation

$$ E[y|X] = f(X) $$

where $f$ is the regression function that can be different things: linear function as in linear regression, or many other functional forms, including non-linear as in non-linear regression, regression forests, or neural networks, etc.

It's unclear what kind of special relationship with the law of total variance you are looking to find. The law applies to any conditional variances, so there is nothing special for linear regression here. Additionally, variance is defined in terms of squared deviations, while mean and linear regression both minimize squared error (as many other models) so there are similarities due to using squared deviations.

Tim
  • 138,066
  • Thanks. One common derivation of regression models is start from the relation $y_i=f_{\theta}(x_i)+\epsilon_i$, where $\epsilon_i$ are i.i.d. with zero mean (as you say according to the choice of f we have linear models, NN, etc.). From this point of view $E[Y_i|X_i]=f_{\theta}(X_i)$ is formally an expectation under the modelling distributional assumptions. – Thomas Oct 16 '22 at 18:25
  • But, apart from this discussion, that could be just formal, I still think that there could be a deep relation between [1] and [2], even if I understand that there are some "jumps": in [2] we have two r.v. and in [1] we have some sample estimators resulting from some modelling assumptions... but the interpretation of the formulas looks so similar to me that I thought there must be something common behind... – Thomas Oct 16 '22 at 18:41
  • @Thomas if regression models conditional expectation it'd be strange if there would have nothing in common. – Tim Oct 16 '22 at 19:01
  • Hi Tim. Referring to your update. As I wrote in the original question, I am asking if the first formula can be derived from the second, applied to some common setup. Of course I understand this is not straightforward. – Thomas Oct 17 '22 at 08:53
  • @Thomas $Var(Y) = \tfrac{1}{n} \sum_i (y_i - \bar y)^2$ so there is no reason why they won't be the same... – Tim Oct 17 '22 at 08:55
  • Sorry Tim but I do not see what you want to say – Thomas Oct 17 '22 at 09:00
  • @Thomas then I'm afraid I either don't understand your question or can't explain it better. The second equation is the law of total variance that is applicable to any conditional variances, so also to conditional variance of (any) predictions given the data, it's not related to linear regression in any sense. – Tim Oct 17 '22 at 09:06