4

I need to run several tests in R to make sure that the basic linear model assumptions hold for this time series OLS model. I'm very new to this, so I'm unsure in some cases of how to test for these, in others how to correct for them. Here's what I have so far:

  • Normality of residuals can be tested with an normal probability plot and histogram superimposing the normal curve; the correction, I believe, is removing outliers from the data.

  • Independence (lack of autocorrelation) can be tested for with a Durbin-Watson test or by examining a sample ACF plot. The correction is Newey-West or robust standard errors.

  • Heteroskedasticity can be tested for with a scatterplot with residuals on the y-axis and fitted values on the x-axis, and I should be able to observe a changing variance. The correction, I believe, is still also robust standard errors (i.e., the above correction should also fix this, right?).

  • Endogeneity: I'm not sure how to test for this, and the only correct I know of is using instrumental variables and 2-stage least squares. Is there another way to correct for this?

  • Multicollinearity: VIF test, and the cut-off point could be specified as something between 5 and 10. (I've been advised to use 10.) The correction is removing variables with high VIF's and rerunning the test until the remaining VIF's are satisfactory.

  • Stationarity: unit-root test. I'm still struggling to implement this in R, but I am aware of the "urca" package and am attempting to test it out. The correction is differencing the series and rerunning the test, most likely throwing out variables that aren't stationary after a second difference.

My questions, then, are:

  1. How would I test for endogeneity, hopefully in R, and how could I fix it without using 2-stage least squares?

  2. Is there another way to correct for heteroskedasticity, or will robust standard errors take care of it? To that point, is there another way to test for this other than a scatterplot?

  3. Are there any errors in what I've written above?

Firebug
  • 19,076
  • 6
  • 77
  • 139
Brad G.
  • 199

1 Answers1

3

Some general remarks:

  1. Model assumptions are never perfectly fulfilled in reality. The idea that one can "make sure that model assumptions hold" is wrong. They will never hold, and nothing can make them hold.

  2. The fact that model assumptions are not fulfilled doesn't mean that model-based methods cannot or should not be used. But unfortunately, some violations of model assumptions may mislead the conclusions from the analysis. The relevant issue is not whether assumptions are fulfilled or not (they never are), but rather if violations are of a kind that will mislead conclusions.

  3. Formal tests are not always a good method to decide whether assumption violations are problematic, because they may reject even in cases that are harmless.

  4. A further problem with checking the model assumptions based on the data is that technically all (standard) theory assumes that analyses are carried out on the data as they are, and that no data dependent decisions and changes have been made before. If you are willing to change your data or analyses conditionally on a misspecification test, this in itself violates the model assumptions. (In most but unfortunately not all cases this is pretty harmless and not such a big problem after having accepted that model assumptions are never fulfilled anyway, however it can occasionally cause trouble. See https://arxiv.org/abs/1908.02218 for a literature review and more thoughts on this.)

More specifically (but far from completely):

  1. Removal of outliers from data is only a good idea if there is good reason to believe that the data are actually erroneous. Being branded as "outlier" by some kind of diagnostic procedure is not a good reason. Your data are information, and if you throw away information, you will harm your analysis even if model assumptions look better. Note that the model has no authority whatsoever telling you that your data are wrong. Better use methods like robust regression, models that allow for heavier tails of error distributions, or even clustering if there are bigger groups of outliers.

  2. Multicollinearity is a problem if the $X^tX$ matrix is not invertible, but not a violation of model assumptions as long as it is. It is true that strong multicollinearity may cause instabilities even if $X^tX$ is invertible, and dimension reduction techniques may help, particularly in very high-dimensional problems, but my experience is that many users get far more concerned about this than they'd need to be. Quite a bit of multicollinearity can be tolerated, and variable selection or dimension reduction is for sure not mandatory and often not good (once more throwing away information and biasing the resulting analysis).

  3. Heteroscedasticity is in my view also often overrated as a problem. As long as it's mild, it doesn't do much harm. With large datasets, homoscedasticity null hypotheses are too easily rejected even if there isn't much trouble. If you use robust techniques anyway, they will help at least a bit with heteroscedasticity. If heteroscedasticity is very strong, transformation may help, but for the reasons stated above, I advice against being overeager to manipulate the data to "fulfill" model assumptions, as this will not work anyway.

  4. Issues with dependence and stationarity can be serious and I have no objections against diagnostics and tests as listed, however usually it is wiser to use a more complex model capturing the issues that you found, rather than throwing away data (at least if there is otherwise good reason regarding your research objective that the data/variables are there in the first place). This may be a good idea also for some other issues such as strong heteroscedasticity.

  5. Think hard about the meaning of the data and how they were collected, and if from this knowledge there are reasons to doubt some of your assumptions. Particularly reasons for non-assumed kinds of dependence can often be better nailed down from knowledge of the situation rather than just looking at the data.

  • Big +1: nice discussion. One might consider adding points about checking for missing variables and diagnosing departures from linearity. I would like to suggest your discussion of transformations, while excellent, overlooks the main reason to examine heteroscedasticity: it is an opportunity to identify characteristics of the data that will reveal substantial improvements in the model. A spread-vs-level plot of residuals, for instance, can indicate a Box-Cox transformation that--although it introduces one more parameter--can sometimes achieve a far better description of the data. – whuber Jun 29 '22 at 16:17
  • @whuber I agree in principle but sometimes struggle to edit my answers to accommodate points that I didn't have on my radar when writing originally. Can you make your own answer out of this (or maybe edit into mine if you think that's better - which I doubt)? – Christian Hennig Jun 29 '22 at 16:27
  • 1
    "the model has no authority whatsoever telling you that your data are wrong". I think that the information flows the other way and the data is (perhaps) telling you that the model is wrong. – dipetkov Jun 30 '22 at 12:11
  • I'm not insisting on including anything; I just wanted to suggest additional points for readers to be aware of. And I'm sure the list is still incomplete... . – whuber Jun 30 '22 at 12:46