8

I have a simple linear OLS regression $Y_i = \alpha+ \beta_1 X_{1i} + \beta_2 X_{2i} + e_i$ where $e_i \sim N(0,\sigma)$. I have estimated the regression from the data and obtained estimates for my coefficients as well as the corresponding covariance matrix. Assume my dataset is large (>500)

Now I would like to construct the 95% confidence interval for $Z = a_1 \beta_1 + a_2 \beta_2$. My first thought was that I could assume that Z follows a normal distribution because my regression coefficients are approximately normal and a linear combination of normal variables is also a normal variable. I do find some sources where they treat $Z$ as a normal random variable.

But now I am in doubt because I found that a linear combination of normal random variables is only normal if they are independent. So is it still save to assume $Z$ is random?

mdewey
  • 17,806
  • 2
    Your coefficients are constants, so I wonder what you mean by saying that they "are approximately normal". At any rate, your coefficients are a function of your data (viz. your $Y$ values), which are random. Thus, they are random, & functions of them ($Z$) are random in turn. I think you meant to ask if it is safe to assume they are independent. This will turn out to be true if $X_1$ & $X_2$ are independent. However, a linear combination of normal random variables is still random if the variables are not independent, you just have to use a more complicated formula. – gung - Reinstate Monica Jan 13 '15 at 22:50
  • I actually meant the sampling distribution of the estimates. I actually want to construct the 95% CI of $z = \beta_1 + \beta_2$.

    My approach was to determine the expected value as the sum of the estimates and the variance of z with the formula of the variance for correlated variables ($\operatorname{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \sum_{j=1}^n \operatorname{Cov}(X_i, X_j) = \sum_{i=1}^n \operatorname{Var}(X_i) + 2\sum_{1\le i}\sum_{<j\le n}\operatorname{Cov}(X_i,X_j)$.

    In order to construct a CI, I need to know if the distribution of Z is normal.

    – user58571 Jan 13 '15 at 23:03
  • Related: http://stats.stackexchange.com/questions/16724/how-to-find-a-confidence-interval-for-a-contrast – Andrew M Jan 13 '15 at 23:13

2 Answers2

14

it would be better to explain this in matrix notation. Suppose the general Gauss-Markov linear model $$\mathbf{y = X \boldsymbol \beta + \boldsymbol \epsilon}$$For your case, $\mathbf{X}$ = ($\mathbf{1}$, $\mathbf{x_1}$, $\mathbf{x_2}$) and $\mathbf{\boldsymbol \beta} = (\alpha, \; \beta_1 \; \beta_2)'.$ The OLS estimator is $$\hat{\boldsymbol \beta} = \left(\mathbf{X}'\mathbf{X} \right)^{-1}\mathbf{X}'\mathbf{y}.$$ Since $\mathbf{y} \ \sim \ N(\mathbf{X \boldsymbol \beta}, \sigma^2)$, then $$\hat{\boldsymbol \beta} \ \sim \ N \left(\mathbf{X \boldsymbol \beta}, \sigma^2(\mathbf{X}'\mathbf{X})^{-1} \right).$$ Based on the above result, you are able to derive the distribution of $Z = \alpha_1 \beta_1 + \alpha_2 \beta_2$ if you write $Z = \boldsymbol \lambda' \boldsymbol \beta$, where $\lambda' = (0, \alpha_1, \alpha_2)$. Denote correspondingly that $$\hat{Z} = \alpha_1 \hat{\beta}_1 + \alpha_2 \hat{\beta}_2 = \boldsymbol \lambda' \hat{\boldsymbol \beta},$$ then you can easily get that $$\frac{\hat{Z} - Z}{\sigma\sqrt{\boldsymbol \lambda' (\mathbf{X}' \mathbf{X})^{-1} \boldsymbol \lambda}} \ \sim \ N(0, 1).$$ Note that $\sigma$ is unknown, so you cannot construct the confidence interval based on the normal distribution. A rigorous approach is to derive the confidence interval based on a $t$-test statistic (I just ignore the derivation but give the result, you should be able to find details in any linear model book or through website). Specifically,$$\frac{\hat{Z} - Z}{\hat{\sigma} \sqrt{\boldsymbol \lambda' (\mathbf{X}' \mathbf{X})^{-1} \boldsymbol \lambda}} \ \sim \ t(n-p),$$ where $\hat{\sigma}^2 = \frac{\mathbf{y}'(\mathbf{I - P_X}) \mathbf{y}}{n-p}$, $\mathbf{P_X}$ is the projection matrix, and $p = rank(\mathbf{X})$.

SixSigma
  • 2,292
5

It's actually not true that your variables need to be independent so that their sum will be normal.

If X and Y are jointly normally distributed with mean $\mu_{1} $ and $\mu_{2} $ and variance $\sigma_{1}^{2}$ and $\sigma_{2}^{2}$ with correlation $\rho $ then Z is still normally distributed with mean $\mu_{1}+\mu_{2} $ and variance $\sqrt { \sigma_{2}^{2}+ \sigma_{1}^{2}+2 \rho \sigma_{2} \sigma_{1}}$ Hopefully that gives you what you need.

Nick Thieme
  • 1,254
  • This is not true. Let X be N(0,1) and Y be exactly -X. Clearly Y is also N(0,1), but X+Y = 0 almost surely, which is NOT normal. – Christopher Aden Jan 13 '15 at 22:51
  • Sorry, I should have included that they should be jointly normal. – Nick Thieme Jan 13 '15 at 22:51
  • A necessary and sufficient condition is that their Covariance matrix is positive definite. That prevents my counterexample and supports your claim. – Christopher Aden Jan 13 '15 at 22:54
  • Ok, but then the question remains if one can assume that the regression coefficients of an OLS regression are jointly normally distributed?

    If so, any pointers to some source which derives/explains this assumption would be great

    – user58571 Jan 13 '15 at 23:09
  • 1
    @ChristopherAden A necessary condition of OLS is that the covariance matrix is positive definite, if this were not true you could not even estimate the coefficients. Thus under the assumptions of OLS where the error are i.i.d normal, any linear combinations of estimated coefficients is also normal with mean and variance calculated similar to as suggested above. – Zachary Blumenfeld Jan 13 '15 at 23:10
  • @user58571 You'll be guaranteed the conditions for the above in OLS – Nick Thieme Jan 13 '15 at 23:13
  • @Zachary Thanks for the answer.

    Should I now mark the answer of Nick Thieme as accepted? Or is there a way to mark this comment as accepted?

    – user58571 Jan 13 '15 at 23:14
  • @user58571, if your residuals (more technically your errors) are normally distributed, then the sampling distributions of your coefficients will be normally distributed. The formula Nick provides will be fine if they aren't independent (note the correlation $\rho$ in the formula). The only thing you have to worry about is perfect multicollinearity, but in that case you wouldn't have gotten any regression output (eg, coefficient estimates). – gung - Reinstate Monica Jan 13 '15 at 23:22
  • The problem is that if you want to construct hypothesis test for a linear combination of coefficient estimates you use an F-distribution as apposed to a t. This has to do with the asymptotic properties of the covariance matrix of the $\beta$ estimates. Btw, it can be shown through the delta method, that even non-linear combinations of the estimates are normally distributed. I will post an answer later if I have time. – Zachary Blumenfeld Jan 13 '15 at 23:28
  • @Zachary I might be misunderstanding your statement. Are you saying that to conduct a hypothesis test for a single linear combination of regression coefficients we should use a F statistic? That should only be true if we want tests on sets of linear combinations and/or joint CI's on different parameter estimates. – Nick Thieme Jan 13 '15 at 23:58
  • @NickThieme Your right that for a single linear combination of variables a t and F test would be equivalent. So sorry if that was confusing. – Zachary Blumenfeld Jan 14 '15 at 00:30