3

In a probability and statistics class, we are often told about the requirement of residuals being uncorrelated when fitting a regression model.

We are often told that the reason for this requirement is that the absence of this usually suggests low quality data.

But mathematically, is there a mathematical reason as to why residuals need to be uncorrelated? What happens if the residuals are in fact correlated?

Can someone please comment on this?

stats_noob
  • 1
  • 3
  • 32
  • 105
  • 3
    One reason: the usual hypothesis tests will be incorrect, since they assume uncorrelated errors. – user51547 Oct 31 '22 at 05:59
  • Thank you! But do you know why this is? Why do hypothesis tests assume uncorrelated errors? Is there some reason for this? – stats_noob Oct 31 '22 at 10:59
  • 2
    Thank you! I also like this link here https://math.stackexchange.com/questions/2957686/explain-about-the-correlation-of-error-terms-in-linear-regression-models . I wonder of this also addresses my problem? – stats_noob Nov 01 '22 at 14:22
  • 1
    Hi: I didn't read the links above but the negative thing about having correlated residuals is that their existence implies that the current model is inadequate because there is explanatory power that can be sucked out of the residuals ( since they are correlated ). For example, a regressor might be getting left out which is causing the residuals to be correlated. Uncorrelated residuals is essentially equivalent to saying that the explanatory power of the model cannot be improved upon. – mlofton Nov 09 '22 at 05:01
  • @mlofton Can you explain what you mean by "Uncorrelated residuals is essentially equivalent to saying that the explanatory power of the model cannot be improved upon."? This statement seems to say that if there are no "patterns" in the residuals, we've arrived at the best model that we can possibly formulate. Is this actually true? – dipetkov Nov 13 '22 at 14:51
  • 1
    What you said is exactly the case. The true model is $Y = \beta X + \epsilon$ where the $\epsilon$ are independent. So, if you have correlated ( not independent ) residuals, then the model that resulted is not the target (true ) model. Therefore, something is wrong with the regressors as far as them being insufficient for the goal of obtaining the true model. – mlofton Nov 14 '22 at 16:20

4 Answers4

5

The requirement of residuals being uncorrelated is typically a heuristic, which is based on the following chain of assumptions:

  • If the regression model itself can be assumed to be approximately linear (actually affine due to the constant term), and the errors of deviation from this model are uncorrelated with zero mean and equal variance, then according to the Gauss-Markov theorem, the ordinary least squares (OLS) estimator will produce the vector of unbiased regression coefficients with lowest variance "envelope", within the class of linear unbiased estimators.

  • Since the model is assumed to be approximately linear (affine), then the residuals from using the estimated vector of regression coefficients will be linearly related to the vector of residuals, and these residuals should also enjoy small uncorrelated values.

Please note that the residuals computed from an OLS estimate in a linear (affine) model will not be identically distributed, due to the effect of Leverage.

Without the "nice" linearity and OLS assumptions mentioned above, i.e., for any general perhaps nonlinear model and/or nonlinear estimation procedure, the requirement of residuals being uncorrelated is typically a heuristic based on the notion of having extracted the most information out of the data. Uncorrelatedness is a heuristic proxy for this notion, not a rigorous mathematical one.

User1865345
  • 8,202
Number
  • 1,040
  • 3
    What do you mean with 'it is a heuristic'? – Sextus Empiricus Nov 13 '22 at 19:27
  • @SextusEmpiricus When the regression model is known to be linear, then obviously uncorrelatedness is not a heuristic but exact. But the OP did not explicitly state a linear model, so in the general case, the residuals from some regression fit will not necessarily be uncorrelated. Nonetheless, standard recommendations are to plot the residuals or assess their correlation (or lack of it), to determine if the fit is OK or not. Of course, a better approach is to fit any model in the Maximum Likelihood (or Bayesian) sense, but this cannot be done in situations with unknown data-generating process. – Number Nov 20 '22 at 15:13
  • 1
    I am not sure I follow this explanation. Why does it matter that the model is linear for the zero correlation to be a heuristic or not? And what do you mean by the term 'heuristic' or by the phrase 'is is a heuristic'? – Sextus Empiricus Nov 20 '22 at 19:39
5

One exercise I give to students is to take a small data set where there is an insignificant relationship between Y and X. Then, simply duplicate the data set so that there are twice as many observations. You will notice that the coefficient estimates and R-squared value are identical to the original, but the p-value is smaller. Repeating this process, you can make the p-value as small as you want.

Now, this is obviously silly statistical practice, but what is specifically violated here? The answer is the uncorrelated errors assumption. All the other assumptions (linearity, homoscedasticity, normality), if valid for the process that produced the original small sample, remain valid for the process that produced the duplicated sample.

The point is that the big bad wolf indeed has teeth: gross violations of the uncorrelated errors assumption can have enormous effects on the validity of the inferences (intervals as well as tests).

BigBendRegion
  • 5,871
  • 1
  • 18
  • 30
3

Uncorrelated or independent residuals/observations/errors* are often assumed (but not required) for statistical inference like computing confidence intervals or p-values. If that assumption is wrong then the computed values are wrong (and methods that assume correlated observations would have been better).

See for instance the following time series that are randomly generated.

  • First 49 times two autocorrelated series with $cov(y_i,y_j) = 0.9^{|i-j|}$

    Among these first 49 cases there are 20 cases with $p<0.05$ (shown in red text) and the p-values are underestimated (and significance is overestimated).

  • Second 49 correlated series without auto-correlation.

    Here there is only one case with $p<0.05$ which is closer to the expected number.

example


It is not the residuals, but the error terms or observations that are assumed to be independent. The residuals will be correlated. See for instance the illustration in Why are the residuals in $\mathbb{R}^{n-p}$?

3

Correlated errors are not the big bad wolf --- they will not blow your house down

Strictly speaking, it is the error terms that are assumed to be uncorrelated in a standard regression model. The residuals are actually slightly correlated owing to their mutual dependence on the estimated parameters in the model. If the error terms in a regression analysis are correlated then it is possible to build this into the model, which is essentially what is done in a linear mixed model. Moreover, it is simple to test the assumption that the errors are uncorrelated using diagnostic tests; usually this would involve looking at correlation or auto-correlation estimates for the residuals.

If you don't account for correlated errors in a regression model (i.e., if they are actually correlated but you fit a model that assumes they are uncorrelated) then this will mean that your estimated variance for the coefficient estimators and residuals will sytematically under or over-estimate the true values. For example, if the error terms are positively correlated (which is the more common case) and you treat them as uncorrelated, you will tend to underestimate the true variance in the coefficient estimators in the model.

Correlated error terms are certainly not the big bad wolf of statistics --- they will not blow your house down and there are much nastier problems that can be encountered. The main thing that keeps them in check is the fact that diagnostic tests in regression can generally estimate residual correlation well, so they can be used to check the plausibility of underlying assumptions about the correlation of the error terms. If you find positive correlation in your error terms you can model this using a linear mixed model, where the additional random effects term in the model induces positive correlation in the overall error term and thereby generalises to a case that allows such correlation.

Ben
  • 124,856