4

Lets say I have a linear regression: $$y \sim 1 + x_1+x_2$$

where the range of $x_2$ is $[0,10]$. I fit this model using lm or rlm with regression weights in R. I collect the residuals and plot them against $x_1$, I found that the residuals show a pattern with respect to the variable $x_1$. The $R^2$ of regressing the residuals onto $x_1$ is $20\%$. Is that possible? What could be the causes?


After the same linear regression as above, if I take a smaller portion of the data, say all the data with $x_2<6$. Then I collect the residuals and $x_1$ of this subset and plot the subsetted residuals against the subsetted $x_1$. I found that the residuals still show a pattern with respect to the variable $x_1$.

(The two 20% above are just for example... they are not related ... and maybe there is a theory saying that one should be definitely larger than the other, etc. )

Is that possible? What could be the causes?


Edit: Let me try to describe the shape of the pattern.

Lets say the range of $x_1$ is $[0, 100]$.

  • At around $x_1=1$, the residuals are in a vertical band of $[-0.1, 0.1]$.

  • At around $x_1=10$, the residuals are in a vertical band of $[-1, 1]$.

  • ...

  • At around $x_1=100$, the residuals are in a vertical band of $[-10, 10]$.

I intentionally put these numbers so you see the upper-band and lower-band are growing somewhat linearly as $x_1$ increases. I know this is heteroskedasiticity. But I guess since I am concerned about "bias", not inference... so I don't worry about the heteroskedasiticity...

Macro
  • 44,826
Luna
  • 2,345
  • What does the pattern look like? Usually a pattern is $x_i$ vs. $\varepsilon_i$ indicates that there are non-linear effects of $x_i$ not subsumed by the fitted model. Sometimes this can be remedied by inserting polynominal terms in $x_i$ or some other transformed version of $x_i$. Also, why did you use the term bias in the title? – Macro Jul 25 '12 at 21:14
  • @Macro I suppose that Luna called it bias because if there is truly a remaining systematic effect like the polynomial term in x$_i$ that you posited the model estimates of y could be biased estimates of the "true" y at least at some points in the x space. – Michael R. Chernick Jul 25 '12 at 21:25
  • 1
    Note that my answer to your question here shows that the residuals must be uncorrelated with your predictor variables if you've fit a least squares regression (with the intercept). Therefore, the $R^2$ can't be $20%$ - it must be $0$. There could be a non-linear relationship between $x_i$ and $\varepsilon_i$ though, so I'll wait to hear what that relationship looks like. – Macro Jul 25 '12 at 21:26
  • 2
    @Luna, it sounds like you're describing a "horn" shape, indicating heteroskedasticity. If this is a linear model, then I don't think the heteroskedasticity will bias your estimates but it will affect your inference (i.e. $p$-values and confidence intervals) so you'll want to take care of it if you plan to do any inference. You may want to consider generalized least squares (the gls function in R), which is a common remedy for heteroskedasticity. – Macro Jul 25 '12 at 21:48
  • 1
    @Luna, can you make a plot of the residuals vs. $x_1$ (or whichever exactly is the actual problem) & post it in your question? The 6th button from the left (that looks like a picture of a blue sky) when you're editing will open a wizard & let you upload a png file from your machine. That will help us understand the problem you're having. – gung - Reinstate Monica Jul 25 '12 at 22:59

1 Answers1

8

What you've described are heteroscedastic errors and regarding your question about bias:

Heteroscedasticity does not bias least squares estimators of regression coefficients

Suppose you have a response variable $Y_i$ and and $p$-length vector of predictors ${\bf X}_{i}$ such that

$$ Y_i = {\bf X}_i {\boldsymbol \beta} + \varepsilon_i $$

where ${\boldsymbol \beta} = \{ \beta_0, ..., \beta_p \}$ is the vector of regression coefficients and the errors, $\varepsilon_i$ are such that $E(\varepsilon_i)=0$ with no restrictions on the variance except that it is finite for each $i$. Then the least squares estimator of ${\boldsymbol \beta}$ is

$$ \hat {\boldsymbol \beta} = ( {\bf X}^{{\rm T}} {\bf X} )^{-1} {\bf X}^{{\rm T}} {\bf Y} $$

Where $$ {\bf X} = \left( \begin{array}{c} {\bf X}_1 \\ {\bf X}_2 \\ \vdots \\ {\bf X}_n \\ \end{array} \right) $$

is a matrix where the rows are the predictor vectors for each individual, including $1$s for the intercept and ${\bf Y}$, ${\boldsymbol \varepsilon}$ are similarly defined as the vector of response values and errors, respectively.

Regarding the expected value of $\hat {\boldsymbol \beta}$, it helps to replace ${\bf Y}$ with $({\bf X} {\boldsymbol \beta} + {\boldsymbol \varepsilon})$ to get that

$$ \hat {\boldsymbol \beta} = ( {\bf X}^{{\rm T}} {\bf X} )^{-1} {\bf X}^{{\rm T}} ({\bf X} {\boldsymbol \beta} + {\boldsymbol \varepsilon}) = \underbrace{( {\bf X}^{{\rm T}} {\bf X} )^{-1} {\bf X}^{{\rm T}} {\bf X} {\boldsymbol \beta}}_{= {\boldsymbol \beta}} + ( {\bf X}^{{\rm T}} {\bf X} )^{-1} {\bf X}^{{\rm T}} {\boldsymbol \varepsilon} $$

Therefore, $E(\hat {\boldsymbol \beta}) = {\boldsymbol \beta} + E \left( ( {\bf X}^{{\rm T}} {\bf X} )^{-1} {\bf X}^{{\rm T}} {\boldsymbol \varepsilon} \right ) $, so we just need the right hand term to be 0. We can derive this by conditioning on ${\bf X}$ and averaging over ${\bf X}$ using the law of total expectation:

\begin{align*} E \left( ( {\bf X}^{{\rm T}} {\bf X} )^{-1} {\bf X}^{{\rm T}} {\boldsymbol \varepsilon} \right ) &= E_{ {\bf X} } \left( E \left( ( {\bf X}^{{\rm T}} {\bf X} )^{-1} {\bf X}^{{\rm T}} {\boldsymbol \varepsilon} \right | {\bf X}) \right) \\ & = E_{ {\bf X} } \left( {\bf X}^{{\rm T}} {\bf X} )^{-1} {\bf X}^{{\rm T}} E ( {\boldsymbol \varepsilon} | {\bf X} ) \right) \\ &= 0 \end{align*}

where the final line follows from the fact that $E( {\boldsymbol \varepsilon} | {\bf X} )=0$, the so-called strict exogeneity assumption of linear regression. Nothing here has relied on homoscedastic errors.

Note: While heteroscedasticity does not bias the parameter estimates, useful results including the Gauss-Markov Theorem and the covariance matrix of the $\hat {\boldsymbol \beta}$ being given by $\sigma^2 ({\bf X}^{\rm T} {\bf X})^{-1}$ do require homoscedasticity.

Macro
  • 44,826
  • Thanks Macro. But (1) I am not sure we are talking about the same thing. My question was about "The R^2 of regressing the residuals onto x 1 is 20% ." (2) In my question, I had weights in regression. – Luna Jul 27 '12 at 15:00
  • Hi @Luna, two things: (1) As I showed you in my answer here, and commented above, each predictor will have $0$ correlation (and therefore $R^2=0$) with the residuals. (2) The residual plot you described is a textbook description of heteroskedasticity, not correlation. If that's not the case, then perhaps you can include a plot, as I and others have asked for. Perhaps you can tell me what your real question is or edit your question appropriately. – Macro Jul 27 '12 at 15:14
  • Thanks Macro. (1) But with weights, the R^2=0 won't hold any more, right? (2) I won't be able to show a plot; but it has both the heteroskedasticity and the linear slope with leads to R^2=20% ... – Luna Jul 27 '12 at 19:05