3

I'm playing with some multiple linear regression models in r. After I run a regression, I use vif() to see if there is multicollinearity between my predictors. For the model with fixed effects for countries (factor(countryname)), vif() gives incredibly high results for some of the predictors. I would like to know why?

Thomas Bilach
  • 5,999
  • 2
  • 11
  • 33
Ken Lee
  • 351
  • I don’t disagree with your choice, but shouldn’t we see the country effects? Unless you omitted them. – Thomas Bilach May 24 '21 at 18:27
  • @ThomasBilach At the very bottom, you can see the following: factor(countryname) 101363392.344484 58 1.172239. Factor(countryname) is fixed effects for countries. This is what I get when I use vif() for a model with fixed effects. – Ken Lee May 24 '21 at 18:39
  • 1
    So you're using vif() from the car package. The answer here should help. – Thomas Bilach May 24 '21 at 20:39
  • Your code executes ordinary least squares regression. What makes it an "MLR model"? – whuber May 25 '21 at 11:28
  • @whuber MLR stands for multiple linear regression. My model is MLR in that it includes more than one independent variable. And yes, the method of estimation is OLS, as you've mentioned. – Ken Lee May 25 '21 at 12:45
  • Can you give more details about the variables you're looking at? Just from the names, I'm assuming that these are variables describing various aspects of the countries (e.g., their annual GDP, population size, etc). If that's the case, then including indicators for each country is going to increase multicollinearity because each country already has data that is unique to it entered in the model – Billy May 25 '21 at 12:51
  • Yes, you have a correct impression about these variables, @Billy. That's what I've thought as well. But this makes me wonder then: other scholars have included fixed effects for countries, even though their models also include variables like GDP, population, etc. Does this mean that their analyses were flawed because there was multicolinearity between their predictors? I should still look into the post suggested by Thomas---perhaps the answer is there. – Ken Lee May 25 '21 at 13:01
  • 1
    Every field has its standards and routine practices. That's not a defense of practice as usual, but I've found that there is often some methodological lineage within fields that often arose from a specific need but then crept into more and more unhelpful applications. I can't speak for your field, but I know that in psychology it can be hard to find people who actually check the assumptions of regression. We've been taught it's "robust" and seemingly are happy to abuse it. Your description makes me wonder whether mixed models are better with countries as a random factor – Billy May 25 '21 at 13:07
  • 1
    @KenLee In regard to your other concern, it isn't "flawed" to adjust for covariates. So long as GDP and/or population size vary over time, then it is permissible to include them. – Thomas Bilach May 25 '21 at 19:23
  • @ThomasBilach Even if this results in high multicolinearity as shown in my post? E.g. see population taking the value of 50 for GVIF^(1/(2*Df)). – Ken Lee May 25 '21 at 19:57
  • @ThomasBilach, I'm not sure that I agree with the requirement that those variable vary over time unless the model is longitudinal in nature. Cross-sectionally, the worst case scenario is that including countries results in complete redundancy since each country corresponds to a unique vector of predictor values. The model doesn't know that the variables are changing over time unless its specified to know that. If this is a longitudinal model, then I think the case for multilevel modeling is even stronger since you'd likely want to account for the non-independence of your observations – Billy May 25 '21 at 20:11
  • @Billy Yes. I was assuming the OP is working with panel or repeated cross-sectional data. – Thomas Bilach May 25 '21 at 21:02
  • @ThomasBilach, @Billy. Indeed, I am working with panel data. Sorry, but I am getting a little lost in this discussion. Thomas, do you think that I can use country fixed effects together with my population variable (which indeed varies over time) in my model? The worry that I have is that vif() shows extremely high values for population when I include it, as you can see in my example. Hence, I'm afraid that I can't include population due to multicolinearity, even though I've seen papers in political science/economics including both population variable and country fixed effects. – Ken Lee May 25 '21 at 21:08
  • The question of multicollinearity as an issue comes down to what you're analyzing. Mutlicollinearity typically impacts primarily the coefficient in question, so if need to interpret a variable with high VIF, then multicollinearity is definitely an issue. Seems to me like countries is a reasonable random effect and is the intended reason for including it in the regression in the first place. A multilevel model also addresses the dependency in your data – Billy May 25 '21 at 23:00