Problem
This is more of a theoretical question than a practical one...
Let's say I have three Random Variables: $Y$, $X_1$, and $X_2$. With the following properties:
- Cor($Y$,$X_1$) > 0
- Cor($Y$,$X_2$) = 0
- Cor($X_1$, $X_2$) > 0
You run two regressions:
$$ Y = \alpha * X_1 + \epsilon $$
and:
$$ Y = \beta_1 * X_1 + \beta_2 * X_2 + \epsilon $$
Which is larger: $\alpha$ or $\beta_1$?
Approach 1: Dumb One
My gut instinct says the following, if the formula for the coefficient of a simple OLS, $\alpha$ is usually:
$$\tag{1} \alpha = \frac{Cov(X_1,Y)}{Var(X_1)}$$
We now have a multiple regression beta when we include $X_2$:
$$\tag{2} \hat{\beta}_{X_1+X_2} = \frac{Cov(X_1+X_2,Y)}{Var(X_1+X_2)}$$
Then the numerator doesn't change: $Cov(X_1+X_2,Y) = Cov(X_1,Y) + Cov(X_2,Y) = Cov(X_1, Y)$
The denominator changes because the variance now becomes: $Var(X_1) + Var(X_2) + 2 Cov(X_1,X_2)$
Which, because $Cov(X_1, X_2) > 0$, it means that we have this inequality:
$$Var(X_1) + Var(X_2) + 2 Cov(X_1,X_2) > Var(X_1)$$
Intuitively, $\beta_1$ is less stable so it will be larger because of the instability.
Approach 2: Google It.
I thought I'd check my answer and found a formula for $\beta_i$ along the lines of Equation (1) that takes into account the correlation between coefficients. It's from this website (written once in their notation, and another in mine):
$$ \tag{3} \beta_1 = \frac{r_{1y} - r_{2y}r_{12}}{1 - r_{12}^2} $$
$$ \beta_1 = \frac{Cor(X_1,Y) - Cor(X_2,Y) * Cor(X_1,X_2)}{1-Cor(X_1,X_2)^2} $$
Their notation is cleaner. So I'll stick with using it: $Cov(X_1,Y) = s_{1y}$, $Cor(X_1,Y)=r_{1y}$, $SD(X) = s_x$, and $Var(X) = s_x^2$
Now I rewrite Equation (1) in terms of correlations, in their notation:
$$ \tag{1b} \alpha = \frac{s_{1y}}{s_x^2} = r_{1y} \frac{s_y}{s_x}$$
So comparing Equation (3) with Equation (1b), we have:
$$ \alpha > \beta_1 $$
$$ r_{1y} \frac{s_y}{s_x} > \frac{r_{1y} - r_{2y}r_{12}}{1 - r_{12}^2} $$
if and only if:
$$ \frac{1}{1-r_{12}^2} > \frac{s_y}{s_x} $$
$$ \tag{4} 1 - r_{12}^2 < \frac{s_x}{s_y} $$
Questions
1 - Are Approach 2 and Equation (4) correct? The answer is it depends? I kinda thought it was cut and dry and the answer would either be larger or smaller
2 - I'm not sure that Equation (2) makes sense... what does the coefficient of a sum of two random variables even mean? Am I writing that correctly?
3 - most importantly: how do I derive something like Equation (3)? I don't want to use Google every time I'm stuck on problems like these (this is for personal practice)
Edit: Approach 3 - Frisch-Waugh-Lovell Theorem
Can't I also use the FWL Theoerem to do:
$$ resid_{Y \sim X_2} = Y - \alpha_2 * X_2 $$
$$ = Y - \frac{Cov(X_2,Y)}{Var(X_2)} * X_2 $$ $$ = Y - 0 $$ $$ resid_{Y \sim X_2} = Y $$
And then run the regression of:
$$ resid_{Y \sim X_2} \sim X_1 = T \sim X_1 $$
...which is just means that $\beta_1 = \alpha$?
I'm so lost now, I feel like I've solved the answer in three different ways and they all sound reasonable enough...