I want to use R to estimate a fixed effects model using different estimation approaches. Note that I am using an unbalanced panel.
The easiest way to do this is using the function lm.
Example:
# load packages and create data
library(dplyr)
set.seed(123)
x <- rnorm(4000)
x2 <- rnorm(length(x))
id <- factor(sample(500,length(x),replace=TRUE))
firm <- data.frame(id = id) %>%
group_by(id) %>%
mutate(firm = 1:n()) %>%
pull(firm)
id.eff <- rlnorm(nlevels(id))
firm.eff <- rexp(length(unique(firm)))
y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x))
db = data.frame(y = y, x = x, id = id, firm = firm)
rm <- db %>% group_by(id) %>% summarise(firm = max(firm)) %>% filter(firm == 1) %>% pull(id)
db = db[-which(db$id %in% rm), ]
# Run regression
test <- lm(y ~ x + id, data = db)
A more efficient approach is demeaning the variables included into the model specification. In this way, one can exclude the fixed effects from the model. Of course, point estimates will be correct, while standard errors will be not (because we are not accounting for the degrees of freedom used in the demeaning).
# demean data
dbm <- as_tibble(db) %>%
group_by(id) %>%
mutate(y = y - mean(y),
x = x - mean(x)) %>%
ungroup()
# run regression
test2 <- lm(y ~ x, data = dbm)
# compare results
summary(test)$coefficients[2,1]
> 0.9753364
summary(test2)$coefficients[2,1]
> 0.9753364
Another way to do this is to demean the variables and add their grand average
# create data
n = length(unique(db$id))
dbh <- dbm %>%
mutate(yh = y + (sum(db$y)/n),
xh = x + (sum(db$x)/n))
# run regression
test3 <- lm(yh ~ xh, dbh)
# compare results
summary(test)$coefficients[2,1]
> 0.9753364
summary(test2)$coefficients[2,1]
> 0.9753364
summary(test3)$coefficients[2,1]
> 0.9753364
As one can see, the three approaches report the same point estimates (again, standard errors will be different instead).
When I include an additional set of fixed effects in the model specification, the three approaches no longer return the same point estimate. However, differences seem to be negligible and they could be due to rounding.
db$firm <- as.factor(db$firm)
dbm$firm <- as.factor(dbm$firm)
dbh$firm <- as.factor(dbh$firm)
testB <- lm(y ~ x + id + firm, data = db)
testB2 <- lm(y ~ x + firm, data = dbm)
testB3 <- lm(yh ~ xh + firm, data = dbh)
summary(testB)$coefficients[2,1]
> 0.9834414
summary(testB2)$coefficients[2,1]
> 0.9833334
summary(testB3)$coefficients[2,1]
> 0.9833334
A similar behavior occurs if I transform my target variable x from a continuous to a dummy variable.
# create data
x3 <- ifelse(db$x > 0, 1, 0)
db <- db %>% mutate(x3 = x3)
dbm <- dbm %>%
mutate(x3 = x3) %>%
group_by(id) %>%
mutate(x3 = x3 - mean(x3)) %>%
ungroup()
dbh <- dbh %>% mutate(x3 = dbm$x3) %>%
mutate(x3 = x3 + (sum(db$x3)/n)) %>%
ungroup()
# Run regressions
testC <- lm(y ~ x3 + id + firm, data = db)
testC2 <- lm(y ~ x3 + firm, data = dbm)
testC3 <- lm(yh ~ x3 + firm, data = dbh)
summary(testC)$coefficients[2, 1]
> 1.579883
summary(testC2)$coefficients[2, 1]
> 1.579159
summary(testC3)$coefficients[2, 1]
> 1.579159
Now, I want to estimate both the impact of x when this is higher than 0 (i.e., x3) and when this is lower or equal to zero (call it x4). In order to do that, I exclude the intercept from my model. Specifically, I do the following:
# create data
x4 <- ifelse(db$x <= 0, 1, 0)
db <- db %>% mutate(x4 = x4)
dbm <- dbm %>%
mutate(x4 = x4) %>%
group_by(id) %>%
mutate(x4 = x4 - mean(x4)) %>%
ungroup()
dbh <- dbh %>% mutate(x4 = dbm$x4) %>%
mutate(x4 = x4 + (sum(db$x4)/n)) %>%
ungroup()
testD <- lm(y ~ x3 + x4 + id + firm - 1, data = db)
testD2 <- lm(y ~ x3 + x4 + firm - 1, data = dbm)
testD3 <- lm(yh ~ x3 + x4 + firm - 1, data = dbh)
summary(testD)$coefficients[1:2, ]
> 1.2372816 -0.3426011
summary(testD2)
> Call:
lm(formula = y ~ x3 + x4 + firm - 1, data = dbm)
Residuals:
Min 1Q Median 3Q Max
-3.8794 -0.7497 0.0010 0.7442 3.8486
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
x3 1.57916 0.03779 41.788 < 2e-16 ***
x4 NA NA NA NA
... redacted
summary(testD3)$coefficients[1:2]
> 3.254654 1.675495
As you can see, the second approach is not able to estimate the impact of x4 on y. At the same time, the first and the third approach return very different point estimates.
Is anyone able to explain me why I cannot obtain the same point estimates for this last exercise?
Is there anything wrong in the way I include the second set of fixed effects? Or is there anything wrong in the way I include the variables x3 and x4?