9

I have a dataset where there is a high degree of multicollinearity, with all variables correlating positively with each other and the dependent variable. However, on some of the models I run I get a couple of significant negative coefficients. Basically there are two coefficients that depending on what variables I include in the model, I can manipulate their signs.

My understanding is that if the variance-covariance matrix only contain positive values, then all coefficients should also be positive. Is this correct?

1 Answers1

13

Because the question appears to ask about data whereas the comments talk about random variables, a data-based answer seems worth presenting.

Let's generate a small dataset. (Later, you can change this to a huge dataset if you wish, just to confirm that the phenomena shown below do not depend on the size of the dataset.) To get going, let one independent variable $x_1$ be a simple sequence $1,2,\ldots,n$. To obtain another independent variable $x_2$ with strong positive correlation, just perturb the values of $x_1$ up and down a little. Here, I alternately subtract and add $1$. It helps to rescale $x_2$, so let's just halve it. Finally, let's see what happens when we create a dependent variable $y$ that is a perfect linear combination of $x_1$ and $x_2$ (without error) but with one positive and one negative sign.

The following commands in R make examples like this using n data:

n <- 6                  # (Later, try (say) n=10000 to see what happens.)
x1 <- 1:n               # E.g., 1   2 3   4 5   6
x2 <- (x1 + c(-1,1))/2  # E.g., 0 3/2 1 5/2 2 7/2
y <- x1 - x2            # E.g,  1 1/2 2 3/2 3 5/2
data <- cbind(x1,x2,y)

Here's a picture: scatterplot matrix with fits

First notice the strong, consistent positive correlations among the variables: in each panel, the points trend from lower left to upper right.

Correlations, however, are not regression coefficients. A good way to understand the multiple regression of $y$ on $x_1$ and $x_2$ is first to regress both $y$ and $x_2$ (separately) on $x_1$ (to remove the effects of $x_1$ from both $y$ and $x_2$) and then to regress the $y$ residuals on the $x_2$ residuals: the slope in that univariate regression will be the $x_2$ coefficient in the multivariate regression of $y$ on $x_1$ and $x_2$.

The lower triangle of this scatterplot matrix has been decorated with linear fits (the diagonal lines) and their residuals (the vertical line segments). Take a close look at the left column of plots, depicting the residuals of regressions against $x_1$. Scanning from left to right, notice how each time the upper panel ($x_2$ vs $x_1$) shows a negative residual, the lower panel ($y$ vs $x_1$) shows a positive residual: these residuals are negatively correlated.

That's the key insight: multiple regression peels away relationships that may otherwise be hidden by mutual associations among the independent variables.

For the doubtful, we can confirm the graphical analysis with calculations. First, the covariance matrix (scaled to simplify the presentation):

> cov(data) * 40
    x1 x2  y
x1 140 82 58
x2  82 59 23
y   58 23 35

The positive entries confirm the impression of positive correlation in the scatterplot matrix. Now, the multivariate regression:

> summary(lm(y ~ x1+x2))
...
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -7.252e-16  2.571e-16 -2.821e+00   0.0667 .  
x1           1.000e+00  1.476e-16  6.776e+15   <2e-16 ***
x2          -1.000e+00  2.273e-16 -4.399e+15   <2e-16 ***

One slope is +1 and the other is -1. Both are significant.

(Of course the slopes are significant: $y$ is a linear function of $x_1$ and $x_2$ with no error. For a more realistic example, just add a little bit of random error to $y$. Provided the error is small, it can change neither the signs of the covariances nor the signs of the regression coefficients, nor can it make them "insignificant.")

whuber
  • 322,774
  • 1
    (+1) (Inside joke: Well that's one way to handle a flag...) :) – cardinal Jul 13 '12 at 14:50
  • 1
    In this example, it seems like multiple regression reveals the 'true' relationship between y and x2 (as y=x1-x2) if I understand correctly. Could a sign-flip of a coefficient be misleading in another instance with two highly collinear covariates? – Blain Waan Feb 25 '20 at 06:17
  • @Blain Interesting question. My sense is that if there aren't numerical issues in the solution (which can be checked), then the result will not be misleading. The collinearity causes the standard errors of estimate to be higher than you might expect, but all that will do is require you to use more data to achieve a desirable level of confidence in the coefficient estimates. Another way to put it is if the results are statistically significant, they won't be any more or less misleading than in any other circumstance. – whuber Feb 25 '20 at 15:30
  • There is a previous post where it said: "if your variables are positively correlated, then the coefficients will be negatively correlated, which can lead to a wrong sign on one of the coefficients" and that ridge regression can help get the corrected signs. I have also seen that if I choose a $\lambda$ high enough, such sign flips reverse and align with what would be found in the bivariates. I'm confused if we should do that at all. The link to the post I am talking about: https://stats.stackexchange.com/questions/1580/regression-coefficients-that-flip-sign-after-including-other-predictors – Blain Waan Feb 25 '20 at 17:12
  • "Can lead" is merely a possibility. Assuming the model is close to correct, then when the coefficients are precisely estimated and significantly different from zero, it's unlikely they have the wrong signs, regardless of collinearity. The claims in that particular answer are vague, general, and not specifically supported with examples or analysis, so you will need to be cautious about applying them. – whuber Feb 25 '20 at 17:38
  • Does this also apply to ordinal regressions? – Ian.T Nov 10 '22 at 16:10
  • @Ian.T To what are you referring by "this"? – whuber Nov 10 '22 at 17:26
  • @whuber the same explanation "multiple regression peels away relationships that may otherwise be hidden by mutual associations among the independent variables." can be applied to ordinal regressions – Ian.T Nov 14 '22 at 23:35
  • @Ian.T Yes, that's the case with any linear multiple regression, which includes GLMs and standard ordinal regression models. However, the analysis of the estimates in terms of a sequence of univariate regressions no longer applies: you have to carry out the multiple regression all at once (often using a numerical search for a maximum likelihood solution or an approximation thereof). – whuber Nov 15 '22 at 14:16