0

I want to use R to estimate a fixed effects model using different estimation approaches. Note that I am using an unbalanced panel.

The easiest way to do this is using the function lm.

Example:

# load packages and create data
library(dplyr)
set.seed(123)
x <- rnorm(4000)
x2 <- rnorm(length(x))
id <- factor(sample(500,length(x),replace=TRUE))
firm <- data.frame(id = id) %>%
group_by(id) %>%
mutate(firm = 1:n()) %>%
pull(firm)
id.eff <- rlnorm(nlevels(id))
firm.eff <- rexp(length(unique(firm)))
y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x))
db = data.frame(y = y, x = x, id = id, firm = firm)
rm <- db %>% group_by(id) %>% summarise(firm = max(firm)) %>% filter(firm == 1) %>%     pull(id)
db = db[-which(db$id %in% rm), ]
# Run regression
test <- lm(y ~ x + id, data = db)

A more efficient approach is demeaning the variables included into the model specification. In this way, one can exclude the fixed effects from the model. Of course, point estimates will be correct, while standard errors will be not (because we are not accounting for the degrees of freedom used in the demeaning).

# demean data
dbm <- as_tibble(db) %>%
group_by(id) %>%
mutate(y = y - mean(y),
       x = x - mean(x)) %>%
ungroup()
# run regression
test2 <- lm(y ~ x, data = dbm)
# compare results
summary(test)$coefficients[2,1]
> 0.9753364
summary(test2)$coefficients[2,1]
> 0.9753364

Another way to do this is to demean the variables and add their grand average

# create data
n = length(unique(db$id))
dbh <- dbm %>%
mutate(yh = y + (sum(db$y)/n),
       xh = x + (sum(db$x)/n))
# run regression
test3 <- lm(yh ~ xh, dbh)
# compare results
summary(test)$coefficients[2,1]
> 0.9753364
summary(test2)$coefficients[2,1]
> 0.9753364
summary(test3)$coefficients[2,1]
> 0.9753364

As one can see, the three approaches report the same point estimates (again, standard errors will be different instead).

When I include an additional set of fixed effects in the model specification, the three approaches no longer return the same point estimate. However, differences seem to be negligible and they could be due to rounding.

db$firm <- as.factor(db$firm)
dbm$firm <- as.factor(dbm$firm)
dbh$firm <- as.factor(dbh$firm)
testB <- lm(y ~ x + id + firm, data = db)
testB2 <- lm(y ~ x + firm, data = dbm)
testB3 <- lm(yh ~ xh + firm, data = dbh)
summary(testB)$coefficients[2,1]
> 0.9834414
summary(testB2)$coefficients[2,1]
> 0.9833334
summary(testB3)$coefficients[2,1]
> 0.9833334

A similar behavior occurs if I transform my target variable x from a continuous to a dummy variable.

# create data
x3 <- ifelse(db$x > 0, 1, 0)
db <- db %>% mutate(x3 = x3)
dbm <- dbm %>% 
mutate(x3 = x3) %>%
group_by(id) %>%
mutate(x3 = x3 - mean(x3)) %>%
ungroup()
dbh <- dbh %>% mutate(x3 = dbm$x3) %>%
mutate(x3 = x3 + (sum(db$x3)/n)) %>%
ungroup()
# Run regressions
testC <- lm(y ~ x3 + id + firm, data = db)
testC2 <- lm(y ~ x3 + firm, data = dbm)
testC3 <- lm(yh ~ x3 + firm, data = dbh)
summary(testC)$coefficients[2, 1]
> 1.579883
summary(testC2)$coefficients[2, 1]
> 1.579159
summary(testC3)$coefficients[2, 1]
> 1.579159

Now, I want to estimate both the impact of x when this is higher than 0 (i.e., x3) and when this is lower or equal to zero (call it x4). In order to do that, I exclude the intercept from my model. Specifically, I do the following:

# create data
x4 <- ifelse(db$x <= 0, 1, 0)
db <- db %>% mutate(x4 = x4)
dbm <- dbm %>% 
mutate(x4 = x4) %>%
group_by(id) %>%
mutate(x4 = x4 - mean(x4)) %>%
ungroup()
dbh <- dbh %>% mutate(x4 = dbm$x4) %>%
mutate(x4 = x4 + (sum(db$x4)/n)) %>%
ungroup()
testD <- lm(y ~ x3 + x4 + id + firm - 1, data = db)
testD2 <- lm(y ~ x3 + x4 + firm - 1, data = dbm)
testD3 <- lm(yh ~ x3 + x4 + firm - 1, data = dbh)
summary(testD)$coefficients[1:2, ]
> 1.2372816 -0.3426011
summary(testD2)
> Call:
lm(formula = y ~ x3 + x4 + firm - 1, data = dbm)

Residuals: Min 1Q Median 3Q Max -3.8794 -0.7497 0.0010 0.7442 3.8486

Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|)
x3 1.57916 0.03779 41.788 < 2e-16 *** x4 NA NA NA NA
... redacted summary(testD3)$coefficients[1:2] > 3.254654 1.675495

As you can see, the second approach is not able to estimate the impact of x4 on y. At the same time, the first and the third approach return very different point estimates.

Is anyone able to explain me why I cannot obtain the same point estimates for this last exercise?

Is there anything wrong in the way I include the second set of fixed effects? Or is there anything wrong in the way I include the variables x3 and x4?

shazz
  • 1

1 Answers1

0

There are situations where centering variables can help or where omitting the intercept can be considered OK. But there's usually no need to, and those procedures can lead to confusing results, as you show. There's seldom a situation where it makes sense to categorize a continuous variable.

Instead of worrying about why your approaches aren't working, try modeling the data as close to the original observations as possible. In practice, those approaches don't end up being "more efficient."

Don't center; that will simplify later predictions. Don't omit intercepts except in very specific and simple cases; omitting the intercept can make remaining coefficients difficult to interpret. Don't categorize your continuous predictor; if you think that it has a nonlinear association with outcome, as you imply toward the end of the question, use flexible modeling like regression splines.

Then interrogate the model with respect to your hypotheses. The summary() function for a linear model is only a start. Additional software tools, like those in the R emmeans, car, or rms package can display and evaluate predictions for any scenario compatible with your model.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Thank you for your reply. However I think this discussion is off topic. Note that when one should centering variable is not the point of the question. I just want to understand how demeaning works. This procedure is common and indeed more efficient. For this reason is used in many software (e.g. Stata) and R packages (e.g. fixest). As such, it is reasonable to try and understand how this works. Also observe that I categorized the variable only for the purpose of the example. One can simply consider that x3 is a dummy variable in the raw data, not a transformation. – shazz Sep 28 '22 at 18:12