1

The four assumptions for bivariate regression are:
    • (L)inearity
    • (I)ndepdent observations
    • (N)ormal errors
    • (E)qual variance
And for multiple regression we add a fifth assumption:
    • no multicollinearity between predictors.

From a purely mathematical perspective, true multicollinearity makes the problem unsolvable in the sense that you cannot obtain a unique solution. But to be clear, this means there is perfect correlation between two or more predictors (or some subset of predictors lie on a smaller dimensional space). This, of course, is bad.

But, in practice, the assessment of this assumption results in the removal of nearly multi-collinear predictors. To clarify, I understand one pragmatic reason for this: even if your predictor variables are not perfectly correlated, if they are very highly correlated, the resulting design matrix may be computationally indefinite (i.e., not truly indefinite, but not solvable with machine number computational limitations). This, too, is bad.

Unfortunately, it seems to me that many textbook authors (particularly in the social sciences) have morphed the fifth assumption to "no NEARLY multi-collinear predictors". (E.g., remove items if the VIF is 10 or more...even though, the predictor variables may not be perfectly multi-collinear.)

My question is this: ¿is this "revised" assumption a valid assumption? or ¿is this a hopeful imposition for convenience's sake?

Note: This query is motivated for pedagogic purposes for introductory level statistics with applied focus (e.g., an applied statistics course for the social sciences).

Gregg H
  • 5,474
  • 3
    Who adds that fifth assumption? It's unnecessary and doesn't appear in many good textbooks on multiple regression. – whuber Mar 19 '22 at 23:58
  • The revised assumption is nonsense. An assumption of no perfect multicollinearity is not. – Richard Hardy Mar 20 '22 at 08:00
  • 2
    A more correct assumption is that the design matrix is full rank. It's possible to have a non-full-rank design matrix without perfect multicollinearity (e.g., including all levels of a factor as dummies when an intercept is present). Also, normality of errors is not required for unbiasedness of the OLS coefficients; it's only required for the t-statistics of the coefficients computed using the usual standard error to have a t-distribution under the null. – Noah Mar 20 '22 at 08:20
  • That is not a standard assumption, but if it bothers you, do a PCA and take all of the principal components. Then you retain all of the information in all of the original variables, but you have uncorrelated features.

    While I am sympathetic to the idea that we might be able to capture almost as much information in the predictors by dropping one that is correlated with others, using fewer parameters and reducing the risk of overfitting, that seems not to pan out in practice, as I discuss here and link to additional reading.

    – Dave Mar 21 '22 at 00:50
  • Some similar posts: https://stats.stackexchange.com/questions/220214/is-multicolinearity-problem-ignorable-under-this-situation, https://stats.stackexchange.com/questions/289424/correlation-and-multicolinearity, – kjetil b halvorsen Mar 21 '22 at 01:25
  • The suggestion that this is not a "standard" assumption or that this does not appear in "good" textbooks does not align with my experience or my education. With regards to MR models, Lomax & Hahs-Vaughn (3rd) state "this assumption is known as collinearity where there is a very strong linear relationship between two or more of the predictors. The presence of severe collinearity is problematic in several respects." (p 675) The point here is that they have relaxed the assumption from perfect collinearity to extreme (near) collinearity. – Gregg H Mar 21 '22 at 12:46
  • Here is another textbook reference: Cohen, Cohen, West & Aiken (3rd). §10.5 & §10.5.1 address this issue, and the exact assumption is essentially replaced with the form I presented in my question. – Gregg H Mar 21 '22 at 12:57
  • Nobody is saying "severe collinearity" is not "problematic:" the issues with this are well known and well discussed in the literature. I think just about everyone would deny that this circumstance is an assumption of OLS regression (or indeed of almost any regression procedure). Indeed, in the better textbooks full rank is not even an assumption: like multicollinearity, rank deficiency is something to be dealt with rather than assumed away. Ponder which of these strategies (solve a problem or assume it doesn't exist) is generally more useful. – whuber Mar 21 '22 at 15:31

0 Answers0